Pitch detection in Python

There are many different algorithms to estimate pitch, but a study found that Praat's algorithm is the most accurate [1]. Recently, the Parselmouth library has made it a lot easier to call Praat functions from Python [2].

[1]: Strömbergsson, Sofia. "Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech." INTERSPEECH. 2016. https://pdfs.semanticscholar.org/ff04/0316f44eab5c0497cec280bfb1fd0e7c0e85.pdf

[2]: https://github.com/YannickJadoul/Parselmouth


There are basically two classes of f0 (pitch) estimation: time domain (with autocorrelation/cross-correlation, for example), and frequency domain (e.g. identifying the fundamental frequency by measuring distances between harmonics, or identifying the frequency in the spectrum with maximum power, as shown in the example above by Sahil M). For many years I have successfully used RAPT (Robust Algorithm for Pitch Tracking), the predecessor of REAPER, also by David Talkin. The widely used Praat software which you mention also includes a RAPT-like cross-correlation algorithm option. Description and code are readily available on the web. A DEB install archive is available here: http://www.phon.ox.ac.uk/releases Pattern detection (rises, falls, etc.) with the pitch function is a separate issue. The suggestion above by Sahil M for using a moving window across the pitch function is a good way to start.


UPDATE in 2019, now there are very accurate pitch trackers based on neural networks. And they work in Python out-of-the-box. Check

https://pypi.org/project/crepe/

ANSWER FROM 2015. Pitch detection is a complex problem, a latest Google's package provides highly intelligent solution to this non-trivial task:

https://github.com/google/REAPER

You can wrap it in Python if you want to access it from Python.


You could try the following. I'm sure you know that human voice also has harmonics which go way beyond 300 Hz. Nevertheless, you can move a window across your audio file, and try to look at change in power in the max ( as shown below) or a set of frequencies in a window. The code below is for giving intuition:

import scipy.fftpack as sf
import numpy as np
def maxFrequency(X, F_sample, Low_cutoff=80, High_cutoff= 300):
        """ Searching presence of frequencies on a real signal using FFT
        Inputs
        =======
        X: 1-D numpy array, the real time domain audio signal (single channel time series)
        Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
        High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
        F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
        """        

        M = X.size # let M be the length of the time series
        Spectrum = sf.rfft(X, n=M) 
        [Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])

        #Convert cutoff frequencies into points on spectrum
        [Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])

        maximumFrequency = np.where(Spectrum == np.max(Spectrum[Low_point : High_point])) # Calculating which frequency has max power.

        return maximumFrequency

voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
    voiceVector.append (maxFrequency( window, samplingRate))

Now based on the intonation of the voice, the maximum power frequency may shift which you can register and map to a given intonation. This may not necessarily be true always, and you may have to monitor shifts in a lot of frequencies together, but this should get you started.