Is there a fast way to find (not necessarily recognize) human speech in an audio file?

The technical term for what you are trying to do is called Voice Activity Detection (VAD). There is a python library called SPEAR that does it (among other things).


webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection (VAD) implementation--it does the best job of any VAD I've used as far as correctly classifying human speech, even with noisy audio.

To use it for your purpose, you would do something like this:

  1. Convert file to be either 8 KHz or 16 Khz, 16-bit, mono format. This is required by the WebRTC code.
  2. Create a VAD object: vad = webrtcvad.Vad()
  3. Split the audio into 30 millisecond chunks.
  4. Check each chunk to see if it contains speech: vad.is_speech(chunk, sample_rate)

The VAD output may be "noisy", and if it classifies a single 30 millisecond chunk of audio as speech you don't really want to output a time for that. You probably want to look over the past 0.3 seconds (or so) of audio and see if the majority of 30 millisecond chunks in that period are classified as speech. If they are, then you output the start time of that 0.3 second period as the beginning of speech. Then you do something similar to detect when the speech ends: Wait for a 0.3 second period of audio where the majority of 30 millisecond chunks are not classified as speech by the VAD--when that happens, output the end time as the end of speech.

You may have to tweak the timing a little bit to get good results for your purposes--maybe you decide that you need 0.2 seconds of audio where more than 30% of chunks are classified as speech by the VAD before you trigger, and 1.0 seconds of audio with more than 50% of chunks classified as non-speech before you de-trigger.

A ring buffer (collections.deque in Python) is a helpful data structure for keeping track of the last N chunks of audio and their classification.


You could run a window across your audio file, and try to extract what fraction of power of the total signal is human vocal frequency ( fundamental frequencies lie between 50 and 300 Hz) . The following is to give intuition and is untested on real audio.

import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
        """ Searching presence of frequencies on a real signal using FFT
        Inputs
        =======
        X: 1-D numpy array, the real time domain audio signal (single channel time series)
        Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
        High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
        F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
        threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.    
        """        

        M = X.size # let M be the length of the time series
        Spectrum = sf.rfft(X, n=M) 
        [Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])

        #Convert cutoff frequencies into points on spectrum
        [Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])

        totalPower = np.sum(Spectrum)
        fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies

        if fractionPowerInSignal > threshold:
            return 1
        else:
            return 0

voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
    voiceVector.append (hasHumanVoice( window, threshold, samplingRate)