Vocabulary Tagging - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Vocabulary Tagging

Vocabulary Tagging allows the user to detect instances of specific words or short phrases in transcripts.

Transcript in AMP JSON format
Text file (.txt) with a list of words to tag. The text file should have one such word/phrase per line

The tool output is in the CSV format, specifying the identified word and its start time point. There will be one word/phrase per line.

MGMs in AMP

The tool is a python script written by the AMP development team to look for occurrences of the vocabulary entries provided by the user.

The .txt file with the list of words to tag should be uploaded to AMP as a Supplemental File of type Vocabulary.
See the Supplemental Files in Workflows page for details on how to use Supplemental files in workflows.\

Use Case:

A collection manager has received a request for content with references to John Doe in the lecture series.

Notes:

The collection manager creates a .txt file with one line that reads John Doe and another line with Doe (to also capture occurrences of just the last name).
The collection manager uploads this file as a Supplemental file at the collection level for the Lecture Series collection; this makes the supplement available to all items in the collection.
The collection manager runs the items in the collection through the workflow shown above and inspects the Tagged Word files generated to identify items where the name occurs.

Word,Start
john doe,00:01:23.680
doe,00:03:32.820
john doe,00:03:49.910
john doe,00:04:13.780

Document generated by Confluence on Feb 25, 2025 10:39