Video Optical Character Recognition VOCR - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki
- AMP: Audiovisual Metadata Platform
- Documentation
- For Collection Managers
- MGMs (Metadata Generation Mechanisms)
Video Optical Character Recognition - VOCR
- Inputs
- Output Formats
- Notes on Use
- Use Cases and Example Workflows
- Evaluating Video OCR MGMs
- AMP JSON Output
Video OCR (VOCR) is the recognition of text in video content, for example, words on objects like signs or clothing, subtitles and captions, or opening/ending credits. Video OCR algorithms may use a variety of methods for detecting text over a series of video frames.
Inputs
Video file
Output Formats
- amp_vocr: VOCR texts, timestamps, and bounding coordinates in AMP JSON format (see below)
- amp_vocr_dedupe: VOCR output in AMP JSON format with same or similar text from consecutive frames deduplicated
- amp_vocr_csv: VOCR output from AMP JSON in CSV format, suitable for opening in a spreadsheet
- azure_video_index: Raw output from Azure Video Indexer containing VOCR data
- azure_artifact_ocr: Raw output from Azure Video Indexer containing detailed bounding coordinates for VOCR output
MGMs in AMP
Tesseract Video OCR
Tesseract is an open source application for OCR of still images. FFmpeg (part of this step, does not need to be added separately) is used to extract frames of a video at specified intervals to run through Tesseract.
Parameters:
- VOCR interval: Interval in seconds by which video frames are extracted for VOCR.
- Dedupe: Whether to dedupe consecutive frames with same text.
- Duplicate gap: Gap (in seconds) within which consecutive VOCR frames with the same text are considered duplicates.
Azure Video OCR
Azure Video Indexer is a proprietary video intelligence platform from Microsoft. Video OCR is included as part of this platform.
Parameters:
- Dedupe: Whether to dedupe consecutive frames with same text.
- Duplicate gap: Gap (in seconds) within which consecutive VOCR frames with the same text are considered duplicates.
Notes on Use
- Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "[Show Relevant Results Only" is turned on.] See Tips for Creating a Workflow for an explanation of what each output option means.
- Azure Video Indexer does not work well on black and white video and will usually generate an error. Try using Tesseract Video OCR for black and white.
Use Cases and Example Workflows
Use case 1: Captioning for accessibility
A collection manager wants to extract text from chapter titles and signs to include in their captions for a collection of early motion pictures, so they can make them more accessible to all users and meet their university's web accessibility requirements. They send their video files through a Video OCR workflow, which returns frames of texts detected in the video and their corresponding timestamps on a contact sheet. The CM then has a student assistant edit and incorporate these texts into the VTT transcript.
Notes:
- This example uses the amp_vocr_dedupe output to show the deduplicated VOCR texts in the contact sheet. Depending on the parameters set for the Tesseract Video OCR step, duplicate texts may still appear.
Use case 2: Metadata from closing credits
A collection manager wants to add all of the contributors to a motion picture to the metadata record. They just know just the texts that appeared in the video, not where or how many times. They send their video files through a Video OCR workflow, which returns the list of texts along with a contact sheet of frames of texts detected in the video and their corresponding timestamps that they can use as reference as they're adding metadata to the records.
Notes:
- [In this example, the collection manager is using Azure for video OCR. Because Azure rolls all of its video tools into one service, the rights specialist must add the ]Azure Video Indexer[ step first and then add ]Azure Video OCR[ to convert the video OCR data from ]Azure Video Indexer[ into AMP JSON for further conversion. Azure Video Indexer also outputs an Azure Artifact OCR file which includes additional data that is needed to create the AMP JSON output. ]
- [This example shows two final outputs for the workflow--contact sheets with a thumbnail for each instance of on-screen text and the same data output into a CSV file.]
Evaluating Video OCR MGMs
There are two tests for evaluating the accuracy of identifying and recognizing text in video with video OCR MGMs: Precision/Recall of Texts and Precision/Recall of Unique Texts.
Precision/Recall of Texts
This test inputs a structured list of texts and compares it to the Video OCR output to find true positives, false positives, and false negatives and calculate precision, recall, and F1 scores, matching on a list of texts to see where text was detected. This is useful if you don't want to record timestamps.
Scores Generated
- Total GT texts
- Total MGM texts
- Count of true positives
- Count of false negatives
- Count of false positives
- Precision
- Recall
- F1
- Accuracy
Output Comparison
This test outputs a table with a list of the ground truth texts and texts found by the MGM. If the text was found in both ground truth and MGM, the comparison is a true positive, if it was found only by the MGM, the comparison will be false positive, and if it is only in the ground truth, the comparison will be false negative. The counts for each in the ground truth data and MGM data are also listed.
Example:
text | gt_count | mgm_count | comparison |
---|---|---|---|
INTERNATIONAL | 1 | 1 | true positive |
FILM BUREAU INC. | 1 | 1 | true positive |
presents | 1 | 0 | false negative |
present | 0 | 1 | false positive |
Journey | 1 | 1 | true positive |
Creating Ground Truth
Create a CSV file with one column labeled "text" with a list of all texts in the video, one line of text per row.
Precision/Recall of Unique Texts
This test inputs a structured list of texts and compares it to the Video OCR output to find true positives, false positives, and false negatives, matching on a unique list of texts to see where text was detected. This is useful if you don't care when the text appears in the video.
Scores Generated
- Total GT unique texts
- Total MGM unique texts
- Count of true positives
- Count of false negatives
- Count of false positives
- Precision
- Recall
- F1
- Accuracy
Output Comparison
This test outputs a table with a list of the ground truth texts and texts found by the MGM. If the text was found in both ground truth and MGM, the comparison is a true positive, if it was found only by the MGM, the comparison will be false positive, and if it is only in the ground truth, the comparison will be false negative. The counts for each in the ground truth data and MGM data are also listed.
Example:
text | gt_count | mgm_count | comparison |
---|---|---|---|
INTERNATIONAL | 3 | 2 | true positive |
FILM BUREAU INC. | 3 | 2 | true positive |
presents | 1 | 0 | false negative |
present | 0 | 1 | false positive |
Journey | 3 | 4 | true positive |
Creating Ground Truth
Create a CSV file with one column labeled "text" with a list of all unique texts in the video, one line of text per row.
Sample Evaluation Use Cases
Use case 1: Metadata from closing credits
A collection manager wants to add all of the contributors to a motion picture to the metadata record. They just know just the texts that appeared in the video, not where or how many times. They send their video files through a Video OCR workflow, which returns the list of texts along with a contact sheet of frames of texts detected in the video and their corresponding timestamps that they can use as reference as they're adding metadata to the records.
Success measures
Key metrics
- High recall of texts (most all the ground truth texts are found in the output, at least one occurrence of each)
- High precision of texts (results are mostly correct, few false positives)
Qualitative measures
- OCR does a good job of capturing the credits (other texts are less important)
AMP JSON Output
Element | Datatype | Obligation | Definition |
---|---|---|---|
media | object | required | Wrapper for metadata about the source media file. |
media.filename | string | required | Filename of the source file. |
media.duration | string | required | The duration of the source file. |
media.frameRate | number | required | The frame rate of the video, in FPS. |
media.numFrames | number | required | The number of frames in the video. |
media.resolution | object | required | Resolution of the video. |
media.resolution.width | number | required | Width of the frame, in pixels. |
media.resolution.height | number | required | Height of the frame, in pixels. |
frames | array | required | List of frames containing text. |
frames[*] | object | optional | A frame containing text. |
frames[*].start | string (s.fff) | required | Time of the frame, in seconds. |
frames[*].objects | list | required | List of instances in the frame containing text. |
frames[*].objects[*] | object | required | An instance in the frame containing text. |
frames[*].objects[*].text | string | required | The text within the instance. |
frames[*].objects[*].language | string | optional | The language of the detected text, (in localized ISO 639-1 code, ex. "en-US"). |
frames[*].objects[*].score | object | optional | A confidence or relevance score for the text. |
frames[*].objects[*].score.type | string (confidence | relevance) | required | The type of score, confidence or relevance. |
frames[*].objects[*].score.value | number | required | The score value, typically a number in the range of 0-1. |
frames[*].objects[*].vertices | object | required | The top left (xmin, ymin) and bottom right (xmax, ymax) relative bounding coordinates. |
frames[*].objects[*].vertices.xmin | number | required | The top left x coordinate. |
frames[*].objects[*].vertices.ymin | number | required | The top left y coordinate. |
frames[*].objects[*].vertices.xmax | number | required | The bottom right x coordinate. |
frames[*].objects[*].vertices.ymax | number | required | The bottom right y coordinate. |
Sample output
Sample Output
{
"media": {
"filename": "myfile.mov",
"duration": "8334.335",
"frameRate": 30.000,
"frameNum": 1547,
"resolution": {
"width": 654,
"height": 486
}
},
"frames": [
{
"start": "625.024",
"objects": [
{
"text": "Beliefs",
"language": "en-US",
"score": {
"type": "confidence",
"scoreValue": 0.9903119
},
"vertices": {
"xmin": 219,
"ymin": 21,
"xmax": 219,
"ymax": 21
}
}
]
}
]
}
Attachments:
VOCR1.png (image/png)
VOCR2.png (image/png)
Screen Shot
2022-10-06 at 8.09.07 PM.png
(image/png)
Screen Shot
2022-10-06 at 8.22.58 PM.png
(image/png)
Document generated by Confluence on Feb 25, 2025 10:39