Speaker segmentation - lmmx/devnotes GitHub Wiki

Direction of arrival estimation https://github.com/morriswmz/doatools.py
Aidan Hogg 2019 ICASSP https://github.com/ahogg/hogg2019-icassp-paper

Etc. (these are for advanced use cases, particularly diarisation with a fixed cast)

I'm interested more in just segmenting a file into component files, to avoid having two voices in a single file (for preprocessing in an STT pipeline)

First set up a conda environment for pyAudioAnalysis (these are the version numbers of the main requirements that'd otherwise be built from wheels, conda is better):

conda create -n speakerseg python numpy==1.18.1 matplotlib==3.1.2 scipy==1.4.1 tqdm==4.46.0 plotly==4.1.1
conda activate speakerseg
git clone [email protected]:tyiannak/pyAudioAnalysis.git
pip install -r pyAudioAnalysis/requirements.txt
pip install pyAudioAnalysis/

Then use the audio segmentation module:

from pyAudioAnalysis.audioSegmentation import speaker_diarization
audio_filename = "/home/louis/Music/sample_audio/r4-today-feb-13_two-speakers.wav"
speaker_vec = speaker_diarization(audio_filename, 2, lda_dim=0)

This works well provided you know the correct number of speakers.

Otherwise it will insert extra speaker changes in, and for a short clip this can butcher the segmentation.

On the other hand, it changes the task of labelling into a task of simply counting participants. This is trivial in some situations (e.g. a meeting of N participants), rarely ever difficult.

There's also inaSpeechSegmenter

conda create -n inaseg
conda activate inaseg
conda install python tensorflow-gpu
pip install inaSpeechSegmenter

Then use the audio segmentation command-line script:

ina_speech_segmenter.py -i "/home/louis/Music/sample_audio/r4-today-feb-13_two-speakers.wav" -o "/home/louis/Music/sample_audio/seg/" -g "false"

It gives very nice results, though will:

sometimes modulate genders on the same speaker (very modern but perhaps inaccurate)
very occasionally declare audio with an odd profile to be "noEnergy" i.e. blank
- This can be detected by having a particularly high peak amplitude (probably other metrics)
To segment well, different speakers should be separated completely, but it may suffice to just split on the noEnergy spaces
- Since the transcription model will likely be more accurate than the segmentation model, I personally prefer to segment on the noEnergy gaps but retain the entire audio

You could probably do interesting things involving re-processing in entirety and in the minimal length segments...