How can I do real-time voice activity detection in Python?
You should try using Python bindings to webRTC VAD from Google. It's lightweight, fast and provides very reasonable results, based on GMM modelling. As the decision is provided per frame, the latency is minimal.
# Run the VAD on 10 ms of silence. The result should be False. import webrtcvad vad = webrtcvad.Vad(2) sample_rate = 16000 frame_duration = 10 # ms frame = b'\x00\x00' * int(sample_rate * frame_duration / 1000) print('Contains speech: %s' % (vad.is_speech(frame, sample_rate))
Also, this article might be useful for you.
I found out that LibROSA could be one of the solutions to your problem. There's a simple tutorial on Medium on using Microphone streaming to realise real-time prediction.
Let's use Short-Time Fourier Transform (STFT) as the feature extractor, the author explains:
To calculate STFT, Fast Fourier transform window size(n_fft) is used as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency bins(n_stft) are calculated over a window size of 512. The window is moved by a hop length of 256 to have a better overlapping of the windows in calculating the STFT.
stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))
# Plot audio with zoomed in y axis def plotAudio(output): fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,10)) plt.plot(output, color='blue') ax.set_xlim((0, len(output))) ax.margins(2, -0.1) plt.show() # Plot audio def plotAudio2(output): fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,4)) plt.plot(output, color='blue') ax.set_xlim((0, len(output))) plt.show() def minMaxNormalize(arr): mn = np.min(arr) mx = np.max(arr) return (arr-mn)/(mx-mn) def predictSound(X): clip, index = librosa.effects.trim(X, top_db=20, frame_length=512, hop_length=64) # Empherically select top_db for every sample stfts = np.abs(librosa.stft(clip, n_fft=512, hop_length=256, win_length=512)) stfts = np.mean(stfts,axis=1) stfts = minMaxNormalize(stfts) result = model.predict(np.array([stfts])) predictions = [np.argmax(y) for y in result] print(lb.inverse_transform([predictions])) plotAudio2(clip) CHUNKSIZE = 22050 # fixed chunk size RATE = 22050 p = pyaudio.PyAudio() stream = p.open(format=pyaudio.paFloat32, channels=1, rate=RATE, input=True, frames_per_buffer=CHUNKSIZE) #preprocessing the noise around #noise window data = stream.read(10000) noise_sample = np.frombuffer(data, dtype=np.float32) print("Noise Sample") plotAudio2(noise_sample) loud_threshold = np.mean(np.abs(noise_sample)) * 10 print("Loud threshold", loud_threshold) audio_buffer =  near = 0 while(True): # Read chunk and load it into numpy array. data = stream.read(CHUNKSIZE) current_window = np.frombuffer(data, dtype=np.float32) #Reduce noise real-time current_window = nr.reduce_noise(audio_clip=current_window, noise_clip=noise_sample, verbose=False) if(audio_buffer==): audio_buffer = current_window else: if(np.mean(np.abs(current_window))<loud_threshold): print("Inside silence reign") if(near<10): audio_buffer = np.concatenate((audio_buffer,current_window)) near += 1 else: predictSound(np.array(audio_buffer)) audio_buffer =  near else: print("Inside loud reign") near = 0 audio_buffer = np.concatenate((audio_buffer,current_window)) # close stream stream.stop_stream() stream.close() p.terminate()
Code credit to: Chathuranga Siriwardhana
Full code can be found here.
I think there are two approaches here,
- Threshold Approach
- Small, deployable, Neural net. Approach
The first one is fast, feasible and can be implemented and tested very fast. while the second one is a bit more difficult to implement. I think you are a bit familiar with 2nd option already.
in the case of the 2nd approach, you will be needing a dataset of speeches that are labeled in a sequence of binary classification like
00000000111111110000000011110000. The neural net should be small and optimized for running on edge devices like mobile.
You can check this out from TensorFlow
This is a voice activity detector. I think it's for your purpose.
Also, check these out.
of course, you should compare performance of the mentioned toolkits and models and the feasibility of the implementation of mobile devices.