Audio Segmentation - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki
- AMP: Audiovisual Metadata Platform
- Documentation
- For Collection Managers
- MGMs (Metadata Generation Mechanisms)
Audio Segmentation
- Inputs
- Output Formats
- MGMs in AMP
- Notes on Use
- Use Cases and Example Workflows
- Evaluating Audio Segmentation MGMs
- AMP JSON Output
Audio segmentation is the classification of audio stream samples into types of sound, such as speech or music, that when joined together, indicate distinct regions of homogeneous sound.
Inputs
Audio file (or audio extracted from a video file with the Extract Audio MGM)
Output Formats
- amp_segments: All segments generated from segmentation with labels and timecodes in AMP JSON format (see details below).
MGMs in AMP
INA Speech Segmenter
INA Speech Segmenter is an open-source audio segmentation tool created by L'Institut National de l'Audiovisuel that separates audio recordings into speech, silence ("no energy"), music, and noise. Silence ("no energy") is the absence of any sound (e.g., a tape is not recording audio). Noise is any ambient sound, even minimal sound barely detectable by the human ear. Results from the AMP implementation are output as time ranges with labels denoting the type of segment. (Noise and silence segments are only counted if they are a minimum of 10 seconds.)
Notes on Use
- Use the Extract Audio MGM as the first step in a workflow regardless of if the input is a video file or an audio file. It serves to normalize the audio piece of the A/V input (for instance, converting the audio format to the format expected by the AMP tools).
- Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "[Show Relevant Results Only" is turned on.] See Tips for Creating a Workflow for an explanation of what each output option means.
Use Cases and Example Workflows
Use case 1: Speech-to-text pipeline
A collection manager wants to run speech-to-text over a collection of recordings that contain music and speech. She first sends each recording through audio segmentation to identify the segments containing speech so that AMP can separate those segments out for the speech-to-text tool, saving processing time.
Notes:
- Even though the collection manager is sending audio files through the pipeline, she adds the Extract Audio step to normalize the audio for the INA Speech Segmenter.
- AMP Diarization is unchecked in the AWS Transcribe step because the collection manager is not interested in distinguishing between speakers in the transcript.
- The Keep Speech Segments step sends only the audio for speech through the AWS Transcribe speech-to-text MGM.
- The Adjust Transcript Timestamps step takes the timestamps from the speech segments to realign the transcript text with the original audio.
- The final result is the transcript in AMP JSON format (see Speech-to-text for details).
Use case 2: Archival processing/digital access files
A collection manager wants to find how much dead air or unrecorded time is on a collection of recordings, so they can document the duration of actual content (or trim the audio files for access) for each recording. They send each file through audio segmentation to identify segments of silence.
Notes:
- Even though the collection manager is sending audio files through the pipeline, she adds the Extract Audio step to normalize the audio for the INA Speech Segmenter.
- The final result is the list and timestamps of segments in the AMP JSON format (see details below).
Evaluating Audio Segmentation MGMs
There are two tests for evaluating the INA Speech Segmenter: INA Speech Segmenter Precision/Recall (by segments) and INA Speech Segmenter Precision/Recall (by seconds).
INA Speech Segmenter Precision/Recall (by segments)
This test inputs a ground truth of timestamp/label annotations for an audio recording using the INA Speech Segmenter labels--speech, music, noise, silence. Ranges of silence or noise should only be recorded if they are longer than 10 seconds. Otherwise, they should be included with the previous segment. Notes may be included in a separate column in the ground truth to assist with qualitative review--these will not be incorporated into the scoring. Accuracy (for all segments and by segment type), precision, recall and F1 are calculated by segment.
Parameters
Analysis threshold: the number of seconds buffer (float) for counting a true positive (match between the ground truth and MGM output). For example, a 2-second threshold will consider a GT and MGM segment a match if both the start and end times for each fall within 2 seconds of each other.
Scores generated
- Total ground truth (GT) segments
- GT silence segments
- GT speech segments
- GT music segments
- GT noise segments
- Total MGM segments
- MGM silence segments
- MGM speech segments
- MGM music segments
- MGM noise segments
- Total true positives
- False negatives
- False positives
- Precision
- Recall
- F1
- Silence true positives
- Speech true positives
- Music true positives
- Noise true positives
- Total accuracy
- Silence accuracy
- Speech accuracy
- Music accuracy
- Noise accuracy
Output comparison
This test outputs a table with the ground truth start and end time codes and label for each segment alongside the time codes and labels for MGM output. Time codes for true positives are listed on the same row while time codes for false positives and false negatives are listed on separate rows. Reviewing this comparison can help you see where in the audio the MGM was incorrect and decide how important these errors are to your use case.
Example:
gt_start | gt_end | start | end | label | comparison |
---|---|---|---|---|---|
0:00:00 | 0:00:29 | 0:00:00 | 0:00:29 | noise | true positive |
0:00:29 | 0:01:05 | 0:00:29 | 0:01:05 | speech | true positive |
0:01:05 | 0:01:30 | 0:01:05 | 0:01:30 | music | true positive |
0:01:30 | 0:15:42 | speech | false negative | ||
0:01:30 | 0:13:11 | speech | false positive | ||
0:13:11 | 0:15:33 | music | false positive | ||
0:15:33 | 0:15:42 | speech | false positive |
Creating Ground Truth
Create a CSV with a minimum of three columns--start, end, label. Label each segment as silence, speech, music, or noise. Values for start and end should be recorded as hh:mm:ss or in seconds (with decimal). For best results, segments should start at the end of the previous one (ex. If a segment of speech ends at 00:45:12, the next segment should start at 00:45:12.)
Example:
start | end | label |
---|---|---|
00:00:00 | 00:01:19 | silence |
00:01:19 | 00:01:38 | speech |
00:01:38 | 00:02:33 | music |
INA Speech Segmenter Precision/Recall (by seconds)
Similar to the precision/recall test by segments, but accuracy, precision, recall, and F1 are calculated by seconds (i.e. comparing classification for each second rather than comparing segments of contiguous classifications). This test inputs a ground truth of timestamp/label annotations for an audio recording using the INA Speech Segmenter labels--speech, music, noise, silence. Ranges of silence or noise should only be recorded if they are longer than 10 seconds. Otherwise, they should be included with the previous segment. Notes may be included in a separate column in the ground truth to assist with qualitative review--these will not be incorporated into the scoring.
Scores generated
- Total ground truth (GT) segments
- GT silence seconds
- GT speech seconds
- GT music seconds
- GT noise seconds
- Total MGM seconds
- MGM silence seconds
- MGM speech seconds
- MGM music seconds
- MGM noise seconds
- Total true positives
- False negatives
- False positives
- Silence true positives
- Speech true positives
- Music true positives
- Noise true positives
- Precision
- Recall
- F1
- Total accuracy
- Silence accuracy
- Speech accuracy
Output comparison
Similar to the Precision/Recall (by segments) output, but by second instead of segment.
Creating Ground Truth
Use the same method as for the Precision/Recall (by segments) test, or use the same ground truth file.
Sample Evaluation Use Cases
Use case 1 -- Speech-to-text pipeline
A collection manager wants to run speech-to-text over a collection of recordings that contain music and speech. She first sends each recording through audio segmentation to identify the segments containing speech so that AMP can separate those segments out for the speech-to-text tool, saving processing time.
Success measures
INA Speech Segmenter correctly classifies all speech segments as speech. False positives (audio classified as speech that is not speech) are kept to a minimum, but their presence will not negatively affect the output of the speech-to-text.
Key metrics are:
- Speech accuracy
- Recall
Qualitative measures:
- Review false negatives to see if they are actually speech or if they are consistently unintelligible speech that would not be picked up by STT.
- Review false positives to see if the duration is longer than desired to send through STT or if there would be any negative impact by sending through STT.
Use case 2 -- Archival processing/digital access files
A collection manager wants to find how much dead air or unrecorded time is on a collection of recordings, so they can document the duration of actual content (or trim the audio files for access) for each recording. They send each file through audio segmentation to identify segments of silence.
Success measures
INA Speech Segmenter correctly classifies all segments of silence. False positives (segments classified as silence that are not silence) are kept to the absolute minimum, as their presence may result in calculating false durations or trimming parts of the recording that should not be cut. False negatives (silences not identified) should be kept low so as not to miscalculate durations.
Key metrics are:
- Silence accuracy
- Precision
- Recall
Qualitative measures:
- Review false positives to see if they are consistently close enough to silence (i.e. "noise" that is very low ambient noise, like a recording that is picking no audio up) to be considered as silence for this use case.
- Review classifications of false negatives. If they are classified as noise (representing very low ambient noise), could you experiment with pre-processing (ex., downsampling or filtering) to get the desired results?
AMP JSON Output
Summary: An array of segments, each with a label, start, and end. Start and end are timestamps in seconds. The label may be one of: "speech", "music", "silence." If the label is "speech," a gender" may be specified as either "male" or "female."
Element | Datatype | Obligation | Definition |
---|---|---|---|
media | object | required | Wrapper for metadata about the source media file. |
media.filename | string | required | Filename of the source file. |
media.duration | string | required | The duration of the source file audio. |
numSpeakers | integer | optional | Number of speakers (if used for diarization). |
segments | array | required | Wrapper for segments of silence, speech, or audio. |
segments[*] | object | optional | A segment of silence, speech, or audio. |
segments[*].label | string | required | The type of segment: silence, speech, or audio. |
segments[*].start | string | required | Start time in seconds. |
segments[*].end | string | required | End time in seconds. |
segments[*].gender | string | optional | The classified gender of the speaker. |
segments[*].speakerLabel | string | optional | Speaker label from speaker diarization. |
Sample output
Sample segmentation output
{
"media": {
"filename": "mysong.wav",
"duration": "124.3"
},
"segments": [
{
"label": "speech",
"start": "0.0",
"end": "12.35",
"gender": "male",
"speakerLabel": "speaker1"
},
{
"label": "music",
"start": "10",
"end": "20"
}
]
}
Attachments:
Segmentationcase1.png
(image/png)
Audioseg1.png (image/png)
Screen Shot
2022-09-29 at 3.40.10 PM.png
(image/png)
Screen Shot
2022-09-29 at 4.06.39 PM.png
(image/png)\
Document generated by Confluence on Feb 25, 2025 10:39