Applause Detection - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Applause Detection

Inputs
Output Formats
MGMs in AMP
Notes on Use
Use Cases and Example Workflows
Evaluating Speech-to-text MGMs
AMP JSON Output

Applause detection classifies segments of sound as "applause" or "non-applause." When used in combination with audio segmentation, this MGM can be useful for finding start and end times of musical works or other types of performances.

Inputs

Audio file (or audio extracted from a video file with the Extract Audio MGM)

Output Formats

amp_applause_segments: Audio segments labeled as applause or non-applause with timecodes in AMP JSON format (see details below).
avalon_applause_sme: Applause and non-applause segments in XML format for ingest into the Avalon system's structural metadata editor.

MGMs in AMP

AMP Applause Detection

Applause Detection is a tool based on the open-source Acoustic Classification and Segmentation audio segmenter from the Brandeis Lab for Linguistics & Computation. The model used with this tool was trained by the AMP development team on audio from MUSAN corpus, HIPSTAS applause samples, and sound from Indiana University collections. This tool inputs an audio file and outputs segments of applause and non-applause with start and end timecodes for each.

Notes on Use

Use the Extract Audio MGM as the first step in a workflow regardless of if the input is a video file or an audio file. It serves to normalize the audio piece of the A/V input (for instance, converting the audio format to the format expected by the AMP tools). For best results, set the parameters to "44.1KHz" for Sample Rate and "Stereo" for Channels.
Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "Show Relevant Results Only" is turned on. See Tips for Creating a Workflow for an explanation of what each output option means.

Use Cases and Example Workflows

Use case 1: Indexing musical works

A cataloger is trying to index the works in a recording of a musical performance with their timecodes for the catalog record. If they know the timecodes of the applause, they can easily scrub to points in the performance where works start (ex. performers come on stage) or end (ex., the performers complete the work, and the audience responds). They run the file through a workflow that includes audio segmentation and applause detection and both outputs in a combined CSV file that they can use to easily scrub to where applause or music are present to index the performance.

Notes:

The cataloger includes the Extract Audio step to normalize the audio files.
The audio file is sent through both Applause Detection and INA Speech Segmenter to get segments of applause and segments of speech, music, silence, and noise.
There is not yet an MGM to export the results to CSV, so the current outputs are in AMP JSON format.

Use case 2: Chaptering videos

A metadata specialist is trying to add chapters to videos in a collection of old talk shows, so that users will be able to easily navigate them in Avalon. In these shows, guests are introduced to rounds of applause, so the metadata specialist would like to import the structural metadata into Avalon and then use the structural editor to adjust the chapter points and add chapter titles as necessary. They run the collection of files through a workflow including applause detection, then import the resulting data into Avalon, so they can edit the chapters in Avalon's structural editor.

Notes:

The metadata specialist includes the Extract Audio step to extract audio from the videos before sending through the Applause Detection MGM.
The Applause Detection to Avalon XML step reformats the output data as XML that can be imported into Avalon for use in the structural metadata editor.

Evaluating Speech-to-text MGMs

There is one test in the AMP MGM Evaluation module for evaluating the accuracy of applause detection.

Precision/Recall of Applause/Non-applause and Time Codes

This test inputs a structured list of timestamp ranges and labels ("applause" or "non-applause") and compares it to the Applause Detection output to find true positives, false positives, and false negatives, matching on both time ranges and labels. Notes may be included in a separate column in the ground truth to assist with qualitative review--these will not be incorporated into the scoring.

Parameters

Analysis threshold: the number of seconds buffer (float) for counting a true positive (match between the ground truth and MGM output). For example, a 2-second threshold will consider a GT and MGM text a match if both the start and end times for each fall within 2 seconds of each other.

Scores generated

Total GT segments of applause/non-applause
Total MGM segments of applause/non-applause
Count of true positives
Count of false negatives
Count of false positives
Precision
Recall
F1
Accuracy

Output comparison

This test outputs a table with the ground truth start and end time codes and label for each applause/non-applause segment alongside the time codes and labels for MGM output. Time codes for true positives are listed on the same row while time codes for false positives and false negatives are listed on separate rows. Reviewing this comparison can help you see where in the audio the MGM was incorrect and decide how important these errors are to your use case.

Example:

gt_start	gt_end	start	end	label	comparison
0:00:00	0:00:10	0:00:00	0:00:10	non-applause	true positive
0:00:10	0:00:13	0:00:10	0:00:12	applause	true positive
		0:00:12	0:01:19	non-applause	false positive
0:00:13	0:01:04			non-applause	false negative
0:01:04	0:01:42			applause	false negative

Creating Ground Truth

Create a CSV with a minimum of three columns--start, end, label. Label each segment of applause as applause and all others in between as non-applause. Values for start and end should be recorded as hh:mm:ss or in seconds (with decimal). For best results, segments should start at the end of the previous one (ex. If a segment of speech ends at 00:45:12, the next segment should start at 00:45:12.)

Sample Evaluation Use Cases

Use case 1: Indexing musical works

Success measures

As many segments as possible are correctly labeled as applause, so that the cataloger can be efficient in finding and identifying the musical works and so they do not make any errors in detecting works. Some error is acceptable, since the cataloger is just using this information to support their manual work.

Key metrics

High recall of applause segments detected in the file. (False positives should ideally be kept to a minimum, so that too much of the cataloger's time is not wasted scrubbing to extra points in the recording.)

Qualitative measures

Review false negatives to see if certain volumes of applause or certain performance venue acoustics may be affecting the tool's ability to detect applause.

Use case 2: Chaptering videos

Success measures

As many segments as possible are correctly labeled as applause, so that the metadata specialist can be efficient in finding and identifying all of the talk show guests.. Some error is acceptable, since they are just using this information to support their manual work.

Key metrics

High recall of applause segments detected in the file. (False positives should ideally be kept to a minimum, so that too much of the specialist's time is not wasted scrubbing to extra points in the recording.)

Qualitative measures

Review false negatives to see if certain volumes of applause or certain performance venue acoustics may be affecting the tool's ability to detect applause.

AMP JSON Output

Element	Datatype	Obligation	Definition
media	object	required	Wrapper for metadata about the source media file.
media.filename	string	required	Filename of the source file.
media.duration	string	required	The duration of the source file audio.
segments	array	required	Wrapper for segments of silence, speech, or audio.
segments[*]	object	optional	A segment of silence, speech, or audio.
segments[*].label	string	required	The type of segment: applause or non-applause
segments[*].start	string	required	Start time in seconds.
segments[*].end	string	required	End time in seconds.

Sample output

Sample Output

{
  "media": {
    "filename": "name.wav",
    "duration": "300"
  },
  "segments":[
    {
        "label": "non-applause",
        "start": 0.0,
        "end": 198.37
    },
    {
        "label": "applause",
        "start": 198.38,
        "end": 206.04
    }
  ]
}

Attachments:

Applause detection case 1.png (image/png)
Applausedetectioncase2.png (image/png)
Screen Shot 2022-09-29 at 4.54.55 PM.png (image/png)
Screen Shot 2022-09-29 at 4.59.03 PM.png (image/png)\

Document generated by Confluence on Feb 25, 2025 10:39