Return predictions wav2vec fairseq

After trying various things I was able to figure this out and trained a wav2vec model from scratch.

Some background: wav2vec uses semi-supervised learning to learn vector representations for preprocessed sound frames. This is similar to what word2vec does to learn word embeddings a text corpus. In the case of wav2vec it samples random parts of the sound file and learns to predict if a given part is in the near future from a current offset position. This is somewhat similar to the masked word task used to train transformers such as BERT. The nice thing about such prediction tasks is that they are self-supervised: the algorithm can be trained on unlabeled data since it uses the temporal structure of the data to produce labels and it uses random sampling to produce contrasting negative examples. It is a binary classification task (is the proposed processed sound frame in the near future of the current offset or not). In training for this binary classification task, it learns vector representations of sound frames (one 512 dim vector for each 10ms of sound). These vector representations are useful features because they concentrate information relevant to predicting speech. These vectors can then be used instead of spectrogram vectors as inputs for speech to text algorithms such as wav2letter or deepSpeech. This is an important point: wav2vec is not a full automatic speech recognition (ASR) system. It is a useful component because by leveraging self-supervised learning on unlabeled data (audio files containing speech but without text transcriptions), it greatly reduces the need for labeled data (speech transcribed to text). Based on their article it appears that by using wav2vec in an ASR pipeline, the amount of labeled data needed can be reduced by a factor of at least 10 (10 to 100 times less transcribed speech is needed apparently). Since un-transcribed speech files are much easier to get than transcribed speech, this is a huge advantage of using wav2vec as an initial module in an ASR system.

So wav2vec is trained with data which is not annotated (no text is used to train it).

The thing which confused me was the following command for training (here) :

python train.py /manifest/path --save-dir /model/path ...(etc.).........

It turns out that since wav2vec is part of fairseq, the following fairseq command line tool should be used to train it:

fairseq-train

As the arguments to this command are pretty long, this can be done using a bash scipt such as

#!/bin/bash
fairseq-train /home/user/4fairseq --save-dir /home/user/4fairseq --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
--arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 --optimizer adam --max-lr 0.005 --lr-scheduler cosine \
--conv-feature-layers "[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)]" \
--conv-aggregator-layers "[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]" \
--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion binary_cross_entropy --num-negatives 10 \
--max-sample-size 150000 --max-tokens 1500000

most of the arguments are those suggested here, only the first two (which are filesystem paths) must be modified for your system.

Since I had audio voice files which were in mp3 format, I converted them to wav files the following bash script:

#!/bin/bash
for file in /home/user/data/soundFiles/*
do
  echo "$file"
  echo "${file%.*}.wav"
  ffmpeg -i "$file" "${file%.*}.wav"
done

They suggest that the audio files be of short duration, longer files should be split into smaller files. The files which I had were already pretty short so I did not do any splitting.

the script wav2vec_manifest.py must be used to create a training data manifest before training. It will create two files (train.tsv and valid.tsv) basically creating lists of which audio files should be used for training and which should be used for validation. The path at which these two files are located is the first argument to the fairseq-train method.

The second argument to the method fairseq-train is the path at which to save the model. After training there will be these two model files:
checkpoint_best.pt
checkpoint_last.pt
These are updated at the end of each epoch so I was able to terminate the train process early and still have those saved model files