Speaker segmentation - lmmx/devnotes GitHub Wiki
- Direction of arrival estimation https://github.com/morriswmz/doatools.py
- Aidan Hogg 2019 ICASSP https://github.com/ahogg/hogg2019-icassp-paper
Etc. (these are for advanced use cases, particularly diarisation with a fixed cast)
I'm interested more in just segmenting a file into component files, to avoid having two voices in a single file (for preprocessing in an STT pipeline)
First set up a conda environment for pyAudioAnalysis (these are the version numbers of the main requirements that'd otherwise be built from wheels, conda is better):
conda create -n speakerseg python numpy==1.18.1 matplotlib==3.1.2 scipy==1.4.1 tqdm==4.46.0 plotly==4.1.1
conda activate speakerseg
git clone [email protected]:tyiannak/pyAudioAnalysis.git
pip install -r pyAudioAnalysis/requirements.txt
pip install pyAudioAnalysis/
Then use the audio segmentation module:
from pyAudioAnalysis.audioSegmentation import speaker_diarization
audio_filename = "/home/louis/Music/sample_audio/r4-today-feb-13_two-speakers.wav"
speaker_vec = speaker_diarization(audio_filename, 2, lda_dim=0)
This works well provided you know the correct number of speakers.
Otherwise it will insert extra speaker changes in, and for a short clip this can butcher the segmentation.
On the other hand, it changes the task of labelling into a task of simply counting participants. This is trivial in some situations (e.g. a meeting of N participants), rarely ever difficult.
There's also inaSpeechSegmenter
conda create -n inaseg
conda activate inaseg
conda install python tensorflow-gpu
pip install inaSpeechSegmenter
Then use the audio segmentation command-line script:
ina_speech_segmenter.py -i "/home/louis/Music/sample_audio/r4-today-feb-13_two-speakers.wav" -o "/home/louis/Music/sample_audio/seg/" -g "false"
It gives very nice results, though will:
- sometimes modulate genders on the same speaker (very modern but perhaps inaccurate)
- very occasionally declare audio with an odd profile to be "noEnergy" i.e. blank
- This can be detected by having a particularly high peak amplitude (probably other metrics)
- To segment well, different speakers should be separated completely, but it may suffice to
just split on the
noEnergy
spaces- Since the transcription model will likely be more accurate than the segmentation model,
I personally prefer to segment on the
noEnergy
gaps but retain the entire audio
- Since the transcription model will likely be more accurate than the segmentation model,
I personally prefer to segment on the
You could probably do interesting things involving re-processing in entirety and in the minimal length segments...