Speech to Text - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Speech-to-Text

Inputs
Output Formats
MGMs in AMP
Notes on Use
Use Cases and Example Workflows
Evaluating Speech-to-text MGMs
AMP JSON Output

Speech to text transcription (also known as automatic speech recognition, or ASR) is the recognition of spoken language in an audio stream and conversion to text.

No speech-to-text tools generate 100% accurate transcripts, but machine-generated transcripts are useful as substitutes for labor-intensive human-generated transcripts when "close" is good enough, or for expediting the creation of a human-generated transcript. STT transcripts are also useful as a starting point for any workflow which cares about the spoken content of the media, such as named-entity recognition (NER) or vocabulary tagging.

Inputs

Audio file (or audio extracted from a video file with the Extract Audio MGM)

Output Formats

amp_transcript: Transcript structured in AMP JSON format (see below).
amp_transcript_adjusted: Transcript in AMP JSON format that is reconstituted with the Adjust Transcript Timestamps step after running speech-only segments through an STT.
amp_transcript_corrected: Transcript in AMP JSON format that has been corrected by humans with the HMGM Transcript Correction MGM.
amp_diarization: Speaker diarization results structured in AMP JSON format (see below). Note that although Whisper produces this file, it includes timing to break transcripts into small segements only, not diarization, as Whisper does not identify speakers.
amp_diarization_adjusted: (AWS Transcribe only) Diarization segments in AMP JSON format that reconstituted with the Adjust Diarization Timestamps step after running speech-only segments through an STT.
aws_transcript: (AWS Transcribe only) Raw transcript and diarization output from AWS Transcribe.
whisper_transcript_text: (Whisper only): Raw transcript output from Whisper; text only
*whisper_transcript_json: *(Whisper only): Raw transcript output from Whisper in JSON with timings
*webvtt: *(Whisper only): Transcript structured in WebVTT format. Whisper is capable of producing this format directly, rather than using the converter module in AMP.

MGMs in AMP

AWS Transcribe

Amazon Transcribe is a proprietary web service that provides transcription (with word-level confidence scores) and diarization, or speaker identification, for up to 10 speakers. Currently, AMP only transcribes speech in English.

Parameters:

Audio format: Format of the audio file. For best results, use a lossless format such as FLAC or WAV with PCM 16-bit encoding.

Whisper

Whisper is an open source speech to text model provides transcription. It does not provide diarization (it does not perform speaker identification). Within AMP, Whisper is capable of transcribing in over 90 languages. It does not handle multi-language files by default.

Parameters:

ML Training model: Choose tiny, base, small, medium, or large. Initial testing was performed on small model (default). The *.en models are English language only models.
Audio language: Auto-detect will take the first 30 seconds of audio to attempt to automatically identify the language. If the language of the content is know, it can be selected here to improve accuracy.

Kaldi Speech to Text (archived)

[UPDATE: As of summer 2023, Kaldi is no longer included in the AMP application, as it did not provide good results and there were problems with the build.]

Kaldi is an open-source application for transcription. The AMP implementation transcribes English language speech and produces results without punctuation or speaker identification. Kaldi requires a WAV file.

Parameters:

none

Notes on Use

Use the Extract Audio MGM as the first step in a workflow regardless of if the input is a video file or an audio file. It serves to normalize the audio piece of the A/V input (for instance, converting the audio format to the format expected by the AMP tools).
Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "[Show Relevant Results Only" is turned on.] See Tips for Creating a Workflow for an explanation of what each output option means.
If your audio includes long segments of silence and/or music, it may be beneficial to remove those segments before submitting the file to the STT tool (see Segmentation). Here are a few reasons why you may want to consider this option:
- STT tools may try to interpret noise, adding bogus text to the transcript.
- The STT tools implemented in AMP (AWS Transcribe and Kaldi) do not do a good job of transcribing speech superimposed on music.
- Removing segments without speech may speed up the processing of your A/V content; this, however, will depend on the total size of segments removed considering that the segmentation process per se takes time.
The STT MGM tools do not include the tools to convert outputs to a more human-readable format. Add additional steps to the workflow to convert STT outputs to Web VTT (AMP Transcript to VTT) or other future formats.

Use Cases and Example Workflows

Use Case 1: Captioning for accessibility

A collection manager wants to generate captions for a collection of oral histories, so they can make them more accessible to all users and meet their university's web accessibility requirements. They first send the videos through a speech-to-text workflow, then download the WebVTT files for upload into their video streaming service.

Notes

The collection manager knows this collection is only speech, no music, so they did not include the Audio Segmentation MGM as a step in the workflow.
The collection manager used AWS Transcribe so that they could use the diarization feature to distinguish between the multiple speakers in the interviews.

Use case 2: NER for controlled access points

An archivist is processing a collection of radio broadcasts and wants to find all mentions of certain political figures, so they can add the names as controlled access terms in the finding aid. They send the audio files through a speech-to-text > NER workflow, then export the NER results in a CSV file which they open in Excel to find the names they are looking for.

Notes

The archivist wants to ignore all of the music played in the broadcast, so they include the INA Speech Segmenter Audio Segmentation MGM in the workflow. They add the Keep Speech Segments step before the speech-to-text to use only speech segments. They add the Adjust Transcript Timestamps after the speech-to-text step to re-align the transcript texts to their timecodes in the original audio file.
The archivist uses Whisper for speech-to-text and SpaCy for NER because they need to use a completely open-source workflow that does not incur costs.
The archivist adds the AMP Named Entities to CSV step to the end of the workflow to convert the AMP Entities data format into a more usable format--a CSV file.

Evaluating Speech-to-text MGMs

There is one test in the AMP MGM Evaluation module for evaluating the accuracy of speech-to-text MGMs.

Word error rate

Word error rate measures speech-to-text accuracy by aligning the generated transcript against a ground truth transcription and calculating the number of errors (substitutions, additions, and deletions) as the cost of restoring the output word sequence to the original input sequence. Scores are measured as percentages, with lower scores representing high accuracy (low word error rate). Scores may exceed 100%, especially if the STT engine produced many insertions. Related to WER, character error rate (CER) is similar to WER, but based on characters. Word information loss (WIL) measures the proportion of word information lost in a transcription, and word information processed (WIP) measures the inverse of WIL.

Scores generated

Word error rate (WER): The proportion of transcription errors that the MGM makes relative to the number of words spoken
Character error rate (CER): Similar to WER, but calculated by characters. This can be a useful measure when spelling is not as important to quality for your use case, as homophone errors will weigh less
Match error rate (MER): Looking at the MGM output aligned with the ground truth, this score gives the probability of a given match to be incorrect (i.e. a substitution, deletion, or insertion).
Word information loss (WIL): A simple approximation of the word information lost in the MGM output.
Word information processed (WIP): The inverse of word information loss.
Substitution rate, insertion rate, and deletion rate (based on WER): Probability of substitution, insertion, or deletion error.

Output comparison

This test generates a table of the ground truth tokens aligned with the tokens with the MGM output for the WER test. If the tokens do not match, the type of error--substitution, insertion, or deletion--will be listed in the column to the right. Reviewing this comparison is important to see what kinds of words the MGM is having trouble interpreting and to help you decide how important those errors are for your particular use case.

Example:

ground_truth	mgm	error
you	you
are	are
here	here
for	for
the	the
black	black
filmmakers	filmmakers
hall	call	substitution
of	the	substitution
fame	plan	substitution
yes	okay	substitution
yeah		deletion
it		deletion
is		deletion
been		deletion
a		deletion
honor		deletion
well	well
the	the
logical	logical
question	question
one	one
must	must
ask	ask

Creating ground truth

The Word Error Rate test takes a plain text transcript (.txt file) as ground truth data. You can create this from scratch or generate it with an MGM and then edit it using the Transcript Correction HMGM. When you have completed the transcript correction, look for the amp_transcript_corrected output in the dashboard. (You may need to toggle "Show Relevant Results Only" to off.) Under the "results" section of this JSON file, copy and paste the transcript text into a new text file and upload as your ground truth.

Sample evaluation use cases

Use case 1: Captioning for accessibility

Success measures

Speech-to-text transcribes speech with high accuracy, and few errors that involve misrepresenting names or key words.
Timestamps are accurate, so that captions appear when they are supposed to in the video player.

Key metrics are:

Word error rate (WER) and related scores

Qualitative measures

Review substitutions, deletions, and insertions to see how poorly the STT represented key words and names

Use case 2: NER pipeline

An archivist is processing a collection of radio broadcasts and wants to find all mentions of certain political figures, so they can add the names as controlled access terms in the finding aid. They send the audio files through the speech-to-text > NER workflow, then review the NER results to find the names they are looking for.

Success measures

Speech-to-text transcribes names with high accuracy (low word error rate).

Key metrics

Word error rate (WER)
Word error rate for content words vs. function words

Qualitative measures

Review substitutions, deletions, and insertions to see how poorly the STT represented names

AMP JSON Output

Summary:

Element	Datatype	Obligation	Definition
media	object	required	Wrapper for metadata about the source media file.
media.filename	string	required	Filename of the source file.
media.duration	string	required	The duration of the source file audio.
results	object	required	Wrapper for transcription results.
results.transcript	string	required	The full text string of the transcription.
results.words	array	required	Wrapper for timecoded words in the transcript.
*results.words[].type**	string (pronunciation \| punctuation)	required	Type of text, pronunciation or punctuation.
*results.words[].text**	string	required	The text of the word.
*results.words[].offset**	integer	required	The offset of the first character of the word in the transcript.
*results.words[].start**	string (s.fff)	required if words[*].type is "pronunciation"	Start time of the word, in seconds.
*results.words[].end**	string (s.fff)	required if words[*].type is "pronunciation"	End time of the word, in seconds.
*results.words[].score**	object	optional	A confidence or relevance score for the entity.
*results.words[].score.type**	string (confidence \| relevance)	required	The type of score, confidence or relevance.
*results.words[].score.value**	number	required	The score value, typically a float in the range of 0–1.

Sample output

Sample output

{      
    "media": {
            "filename": "myfile.wav",
            "duration": "1.500"
        },
    "results": {
        "transcript": "Professional answer.",
        "words": [{             
            "type": "pronunciation"
            "text": "Professional",
            "offset": 0
            "start": "0.100",
            "end": "0.690",
        }, {            
            "type": "pronunciation"
            "text": "answer",
            "offset": 13
               "start": "0.690",
            "end": "1.210",
        }, {    
            "type": "punctuation"
            "text": ".",    
            "offset": 14 
        }]
    }
}

Attachments:

stt_use_case_1.png (image/png)
stt_use_case_2.png (image/png)
Screen Shot 2022-09-26 at 4.58.20 PM.png (image/png)
Screen Shot 2022-09-26 at 4.58.02 PM.png (image/png)
Screen Shot 2022-09-27 at 12.54.53 PM.png (image/png)
Screen Shot 2022-09-27 at 1.02.02 PM.png (image/png)
image2023-9-5_14-40-49.png (image/png)\

Document generated by Confluence on Feb 25, 2025 10:39