Applause Detection - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki
- AMP: Audiovisual Metadata Platform
- Documentation
- For Collection Managers
- MGMs (Metadata Generation Mechanisms)
Applause Detection
- Inputs
- Output Formats
- MGMs in AMP
- Notes on Use
- Use Cases and Example Workflows
- Evaluating Speech-to-text MGMs
- AMP JSON Output
Applause detection classifies segments of sound as "applause" or "non-applause." When used in combination with audio segmentation, this MGM can be useful for finding start and end times of musical works or other types of performances.
Inputs
Audio file (or audio extracted from a video file with the Extract Audio MGM)
Output Formats
- amp_applause_segments: Audio segments labeled as applause or non-applause with timecodes in AMP JSON format (see details below).
- avalon_applause_sme: Applause and non-applause segments in XML format for ingest into the Avalon system's structural metadata editor.
MGMs in AMP
AMP Applause Detection
Applause Detection is a tool based on the open-source Acoustic Classification and Segmentation audio segmenter from the Brandeis Lab for Linguistics & Computation. The model used with this tool was trained by the AMP development team on audio from MUSAN corpus, HIPSTAS applause samples, and sound from Indiana University collections. This tool inputs an audio file and outputs segments of applause and non-applause with start and end timecodes for each.
Notes on Use
- Use the Extract Audio MGM as the first step in a workflow regardless of if the input is a video file or an audio file. It serves to normalize the audio piece of the A/V input (for instance, converting the audio format to the format expected by the AMP tools). For best results, set the parameters to "44.1KHz" for Sample Rate and "Stereo" for Channels.
- Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "Show Relevant Results Only" is turned on. See Tips for Creating a Workflow for an explanation of what each output option means.
Use Cases and Example Workflows
Use case 1: Indexing musical works
A cataloger is trying to index the works in a recording of a musical performance with their timecodes for the catalog record. If they know the timecodes of the applause, they can easily scrub to points in the performance where works start (ex. performers come on stage) or end (ex., the performers complete the work, and the audience responds). They run the file through a workflow that includes audio segmentation and applause detection and both outputs in a combined CSV file that they can use to easily scrub to where applause or music are present to index the performance.
Notes:
- The cataloger includes the Extract Audio step to normalize the audio files.
- The audio file is sent through both Applause Detection and INA Speech Segmenter to get segments of applause and segments of speech, music, silence, and noise.
- There is not yet an MGM to export the results to CSV, so the current outputs are in AMP JSON format.
Use case 2: Chaptering videos
A metadata specialist is trying to add chapters to videos in a collection of old talk shows, so that users will be able to easily navigate them in Avalon. In these shows, guests are introduced to rounds of applause, so the metadata specialist would like to import the structural metadata into Avalon and then use the structural editor to adjust the chapter points and add chapter titles as necessary. They run the collection of files through a workflow including applause detection, then import the resulting data into Avalon, so they can edit the chapters in Avalon's structural editor.
Notes:
- The metadata specialist includes the Extract Audio step to extract audio from the videos before sending through the Applause Detection MGM.
- The Applause Detection to Avalon XML step reformats the output data as XML that can be imported into Avalon for use in the structural metadata editor.
Evaluating Speech-to-text MGMs
There is one test in the AMP MGM Evaluation module for evaluating the accuracy of applause detection.
Precision/Recall of Applause/Non-applause and Time Codes
This test inputs a structured list of timestamp ranges and labels ("applause" or "non-applause") and compares it to the Applause Detection output to find true positives, false positives, and false negatives, matching on both time ranges and labels. Notes may be included in a separate column in the ground truth to assist with qualitative review--these will not be incorporated into the scoring.
Parameters
Analysis threshold: the number of seconds buffer (float) for counting a true positive (match between the ground truth and MGM output). For example, a 2-second threshold will consider a GT and MGM text a match if both the start and end times for each fall within 2 seconds of each other.
Scores generated
- Total GT segments of applause/non-applause
- Total MGM segments of applause/non-applause
- Count of true positives
- Count of false negatives
- Count of false positives
- Precision
- Recall
- F1
- Accuracy
Output comparison
This test outputs a table with the ground truth start and end time codes and label for each applause/non-applause segment alongside the time codes and labels for MGM output. Time codes for true positives are listed on the same row while time codes for false positives and false negatives are listed on separate rows. Reviewing this comparison can help you see where in the audio the MGM was incorrect and decide how important these errors are to your use case.
Example:
gt_start | gt_end | start | end | label | comparison |
---|---|---|---|---|---|
0:00:00 | 0:00:10 | 0:00:00 | 0:00:10 | non-applause | true positive |
0:00:10 | 0:00:13 | 0:00:10 | 0:00:12 | applause | true positive |
0:00:12 | 0:01:19 | non-applause | false positive | ||
0:00:13 | 0:01:04 | non-applause | false negative | ||
0:01:04 | 0:01:42 | applause | false negative |
Creating Ground Truth
Create a CSV with a minimum of three columns--start, end, label. Label each segment of applause as applause and all others in between as non-applause. Values for start and end should be recorded as hh:mm:ss or in seconds (with decimal). For best results, segments should start at the end of the previous one (ex. If a segment of speech ends at 00:45:12, the next segment should start at 00:45:12.)
Sample Evaluation Use Cases
Use case 1: Indexing musical works
A cataloger is trying to index the works in a recording of a musical performance with their timecodes for the catalog record. If they know the timecodes of the applause, they can easily scrub to points in the performance where works start (ex. performers come on stage) or end (ex., the performers complete the work, and the audience responds). They run the file through a workflow that includes audio segmentation and applause detection and both outputs in a combined CSV file that they can use to easily scrub to where applause or music are present to index the performance.
Success measures
As many segments as possible are correctly labeled as applause, so that the cataloger can be efficient in finding and identifying the musical works and so they do not make any errors in detecting works. Some error is acceptable, since the cataloger is just using this information to support their manual work.
Key metrics
- High recall of applause segments detected in the file. (False positives should ideally be kept to a minimum, so that too much of the cataloger's time is not wasted scrubbing to extra points in the recording.)
Qualitative measures
- Review false negatives to see if certain volumes of applause or certain performance venue acoustics may be affecting the tool's ability to detect applause.
Use case 2: Chaptering videos
A metadata specialist is trying to add chapters to videos in a collection of old talk shows, so that users will be able to easily navigate them in Avalon. In these shows, guests are introduced to rounds of applause, so the metadata specialist would like to import the structural metadata into Avalon and then use the structural editor to adjust the chapter points and add chapter titles as necessary. They run the collection of files through a workflow including applause detection, then import the resulting data into Avalon, so they can edit the chapters in Avalon's structural editor.
Success measures
As many segments as possible are correctly labeled as applause, so that the metadata specialist can be efficient in finding and identifying all of the talk show guests.. Some error is acceptable, since they are just using this information to support their manual work.
Key metrics
- High recall of applause segments detected in the file. (False positives should ideally be kept to a minimum, so that too much of the specialist's time is not wasted scrubbing to extra points in the recording.)
Qualitative measures
- Review false negatives to see if certain volumes of applause or certain performance venue acoustics may be affecting the tool's ability to detect applause.
AMP JSON Output
Element | Datatype | Obligation | Definition |
---|---|---|---|
media | object | required | Wrapper for metadata about the source media file. |
media.filename | string | required | Filename of the source file. |
media.duration | string | required | The duration of the source file audio. |
segments | array | required | Wrapper for segments of silence, speech, or audio. |
segments[*] | object | optional | A segment of silence, speech, or audio. |
segments[*].label | string | required | The type of segment: applause or non-applause |
segments[*].start | string | required | Start time in seconds. |
segments[*].end | string | required | End time in seconds. |
Sample output
Sample Output
{
"media": {
"filename": "name.wav",
"duration": "300"
},
"segments":[
{
"label": "non-applause",
"start": 0.0,
"end": 198.37
},
{
"label": "applause",
"start": 198.38,
"end": 206.04
}
]
}
Attachments:
Applause
detection case 1.png (image/png)
Applausedetectioncase2.png
(image/png)
Screen Shot
2022-09-29 at 4.54.55 PM.png
(image/png)
Screen Shot
2022-09-29 at 4.59.03 PM.png
(image/png)\
Document generated by Confluence on Feb 25, 2025 10:39