Audio Segmentation - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Audio Segmentation

Inputs
Output Formats
MGMs in AMP
Notes on Use
Use Cases and Example Workflows
Evaluating Audio Segmentation MGMs
AMP JSON Output

Audio segmentation is the classification of audio stream samples into types of sound, such as speech or music, that when joined together, indicate distinct regions of homogeneous sound.

Inputs

Audio file (or audio extracted from a video file with the Extract Audio MGM)

Output Formats

amp_segments: All segments generated from segmentation with labels and timecodes in AMP JSON format (see details below).

MGMs in AMP

INA Speech Segmenter

INA Speech Segmenter is an open-source audio segmentation tool created by L'Institut National de l'Audiovisuel that separates audio recordings into speech, silence ("no energy"), music, and noise. Silence ("no energy") is the absence of any sound (e.g., a tape is not recording audio). Noise is any ambient sound, even minimal sound barely detectable by the human ear. Results from the AMP implementation are output as time ranges with labels denoting the type of segment. (Noise and silence segments are only counted if they are a minimum of 10 seconds.)

Notes on Use

Use the Extract Audio MGM as the first step in a workflow regardless of if the input is a video file or an audio file. It serves to normalize the audio piece of the A/V input (for instance, converting the audio format to the format expected by the AMP tools).
Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "[Show Relevant Results Only" is turned on.] See Tips for Creating a Workflow for an explanation of what each output option means.

Use Cases and Example Workflows

Use case 1: Speech-to-text pipeline

A collection manager wants to run speech-to-text over a collection of recordings that contain music and speech. She first sends each recording through audio segmentation to identify the segments containing speech so that AMP can separate those segments out for the speech-to-text tool, saving processing time.

Notes:

Even though the collection manager is sending audio files through the pipeline, she adds the Extract Audio step to normalize the audio for the INA Speech Segmenter.
AMP Diarization is unchecked in the AWS Transcribe step because the collection manager is not interested in distinguishing between speakers in the transcript.
The Keep Speech Segments step sends only the audio for speech through the AWS Transcribe speech-to-text MGM.
The Adjust Transcript Timestamps step takes the timestamps from the speech segments to realign the transcript text with the original audio.
The final result is the transcript in AMP JSON format (see Speech-to-text for details).

Use case 2: Archival processing/digital access files

A collection manager wants to find how much dead air or unrecorded time is on a collection of recordings, so they can document the duration of actual content (or trim the audio files for access) for each recording. They send each file through audio segmentation to identify segments of silence.

Notes:

Even though the collection manager is sending audio files through the pipeline, she adds the Extract Audio step to normalize the audio for the INA Speech Segmenter.
The final result is the list and timestamps of segments in the AMP JSON format (see details below).

Evaluating Audio Segmentation MGMs

There are two tests for evaluating the INA Speech Segmenter: INA Speech Segmenter Precision/Recall (by segments) and INA Speech Segmenter Precision/Recall (by seconds).

INA Speech Segmenter Precision/Recall (by segments)

This test inputs a ground truth of timestamp/label annotations for an audio recording using the INA Speech Segmenter labels--speech, music, noise, silence. Ranges of silence or noise should only be recorded if they are longer than 10 seconds. Otherwise, they should be included with the previous segment. Notes may be included in a separate column in the ground truth to assist with qualitative review--these will not be incorporated into the scoring. Accuracy (for all segments and by segment type), precision, recall and F1 are calculated by segment.

Parameters

Analysis threshold: the number of seconds buffer (float) for counting a true positive (match between the ground truth and MGM output). For example, a 2-second threshold will consider a GT and MGM segment a match if both the start and end times for each fall within 2 seconds of each other.

Scores generated

Total ground truth (GT) segments
GT silence segments
GT speech segments
GT music segments
GT noise segments
Total MGM segments
MGM silence segments
MGM speech segments
MGM music segments
MGM noise segments
Total true positives
False negatives
False positives
Precision
Recall
F1
Silence true positives
Speech true positives
Music true positives
Noise true positives
Total accuracy
Silence accuracy
Speech accuracy
Music accuracy
Noise accuracy

Output comparison

This test outputs a table with the ground truth start and end time codes and label for each segment alongside the time codes and labels for MGM output. Time codes for true positives are listed on the same row while time codes for false positives and false negatives are listed on separate rows. Reviewing this comparison can help you see where in the audio the MGM was incorrect and decide how important these errors are to your use case.

Example:

gt_start	gt_end	start	end	label	comparison
0:00:00	0:00:29	0:00:00	0:00:29	noise	true positive
0:00:29	0:01:05	0:00:29	0:01:05	speech	true positive
0:01:05	0:01:30	0:01:05	0:01:30	music	true positive
0:01:30	0:15:42			speech	false negative
		0:01:30	0:13:11	speech	false positive
		0:13:11	0:15:33	music	false positive
		0:15:33	0:15:42	speech	false positive

Creating Ground Truth

Create a CSV with a minimum of three columns--start, end, label. Label each segment as silence, speech, music, or noise. Values for start and end should be recorded as hh:mm:ss or in seconds (with decimal). For best results, segments should start at the end of the previous one (ex. If a segment of speech ends at 00:45:12, the next segment should start at 00:45:12.)

Example:

start	end	label
00:00:00	00:01:19	silence
00:01:19	00:01:38	speech
00:01:38	00:02:33	music

INA Speech Segmenter Precision/Recall (by seconds)

Similar to the precision/recall test by segments, but accuracy, precision, recall, and F1 are calculated by seconds (i.e. comparing classification for each second rather than comparing segments of contiguous classifications). This test inputs a ground truth of timestamp/label annotations for an audio recording using the INA Speech Segmenter labels--speech, music, noise, silence. Ranges of silence or noise should only be recorded if they are longer than 10 seconds. Otherwise, they should be included with the previous segment. Notes may be included in a separate column in the ground truth to assist with qualitative review--these will not be incorporated into the scoring.

Scores generated

Total ground truth (GT) segments
GT silence seconds
GT speech seconds
GT music seconds
GT noise seconds
Total MGM seconds
MGM silence seconds
MGM speech seconds
MGM music seconds
MGM noise seconds
Total true positives
False negatives
False positives
Silence true positives
Speech true positives
Music true positives
Noise true positives
Precision
Recall
F1
Total accuracy
Silence accuracy
Speech accuracy

Output comparison

Similar to the Precision/Recall (by segments) output, but by second instead of segment.

Creating Ground Truth

Use the same method as for the Precision/Recall (by segments) test, or use the same ground truth file.

Sample Evaluation Use Cases

Use case 1 -- Speech-to-text pipeline

Success measures

INA Speech Segmenter correctly classifies all speech segments as speech. False positives (audio classified as speech that is not speech) are kept to a minimum, but their presence will not negatively affect the output of the speech-to-text.

Key metrics are:

Speech accuracy
Recall

Qualitative measures:

Review false negatives to see if they are actually speech or if they are consistently unintelligible speech that would not be picked up by STT.
Review false positives to see if the duration is longer than desired to send through STT or if there would be any negative impact by sending through STT.

Use case 2 -- Archival processing/digital access files

Success measures

INA Speech Segmenter correctly classifies all segments of silence. False positives (segments classified as silence that are not silence) are kept to the absolute minimum, as their presence may result in calculating false durations or trimming parts of the recording that should not be cut. False negatives (silences not identified) should be kept low so as not to miscalculate durations.

Key metrics are:

Silence accuracy
Precision
Recall

Qualitative measures:

Review false positives to see if they are consistently close enough to silence (i.e. "noise" that is very low ambient noise, like a recording that is picking no audio up) to be considered as silence for this use case.
Review classifications of false negatives. If they are classified as noise (representing very low ambient noise), could you experiment with pre-processing (ex., downsampling or filtering) to get the desired results?

AMP JSON Output

Summary: An array of segments, each with a label, start, and end. Start and end are timestamps in seconds. The label may be one of: "speech", "music", "silence." If the label is "speech," a gender" may be specified as either "male" or "female."

Element	Datatype	Obligation	Definition
media	object	required	Wrapper for metadata about the source media file.
media.filename	string	required	Filename of the source file.
media.duration	string	required	The duration of the source file audio.
numSpeakers	integer	optional	Number of speakers (if used for diarization).
segments	array	required	Wrapper for segments of silence, speech, or audio.
segments[*]	object	optional	A segment of silence, speech, or audio.
segments[*].label	string	required	The type of segment: silence, speech, or audio.
segments[*].start	string	required	Start time in seconds.
segments[*].end	string	required	End time in seconds.
segments[*].gender	string	optional	The classified gender of the speaker.
segments[*].speakerLabel	string	optional	Speaker label from speaker diarization.

Sample output

Sample segmentation output

{
    "media": {
        "filename": "mysong.wav",
        "duration": "124.3"
     },
    "segments": [
        {
            "label": "speech",
            "start": "0.0",
            "end": "12.35",
            "gender": "male",
            "speakerLabel": "speaker1"
        },
        {
            "label": "music",
            "start": "10",
            "end": "20"
        }
    ]
}

Attachments:

Segmentationcase1.png (image/png)
Audioseg1.png (image/png)
Screen Shot 2022-09-29 at 3.40.10 PM.png (image/png)
Screen Shot 2022-09-29 at 4.06.39 PM.png (image/png)\

Document generated by Confluence on Feb 25, 2025 10:39