Video Optical Character Recognition VOCR - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Video Optical Character Recognition - VOCR

Inputs
Output Formats
Notes on Use
Use Cases and Example Workflows
Evaluating Video OCR MGMs
AMP JSON Output

Video OCR (VOCR) is the recognition of text in video content, for example, words on objects like signs or clothing, subtitles and captions, or opening/ending credits. Video OCR algorithms may use a variety of methods for detecting text over a series of video frames.

Inputs

Video file

Output Formats

amp_vocr: VOCR texts, timestamps, and bounding coordinates in AMP JSON format (see below)
amp_vocr_dedupe: VOCR output in AMP JSON format with same or similar text from consecutive frames deduplicated
amp_vocr_csv: VOCR output from AMP JSON in CSV format, suitable for opening in a spreadsheet
azure_video_index: Raw output from Azure Video Indexer containing VOCR data
azure_artifact_ocr: Raw output from Azure Video Indexer containing detailed bounding coordinates for VOCR output

MGMs in AMP

Tesseract Video OCR

Tesseract is an open source application for OCR of still images. FFmpeg (part of this step, does not need to be added separately) is used to extract frames of a video at specified intervals to run through Tesseract.

Parameters:

VOCR interval: Interval in seconds by which video frames are extracted for VOCR.
Dedupe: Whether to dedupe consecutive frames with same text.
Duplicate gap: Gap (in seconds) within which consecutive VOCR frames with the same text are considered duplicates.

Azure Video OCR

Azure Video Indexer is a proprietary video intelligence platform from Microsoft. Video OCR is included as part of this platform.

Parameters:

Dedupe: Whether to dedupe consecutive frames with same text.
Duplicate gap: Gap (in seconds) within which consecutive VOCR frames with the same text are considered duplicates.

Notes on Use

Each of the tools has at least one of the outputs checked by default. Checked outputs will display the workflow step and output in the dashboard when "[Show Relevant Results Only" is turned on.] See Tips for Creating a Workflow for an explanation of what each output option means.
Azure Video Indexer does not work well on black and white video and will usually generate an error. Try using Tesseract Video OCR for black and white.

Use Cases and Example Workflows

Use case 1: Captioning for accessibility

A collection manager wants to extract text from chapter titles and signs to include in their captions for a collection of early motion pictures, so they can make them more accessible to all users and meet their university's web accessibility requirements. They send their video files through a Video OCR workflow, which returns frames of texts detected in the video and their corresponding timestamps on a contact sheet. The CM then has a student assistant edit and incorporate these texts into the VTT transcript.

Notes:

This example uses the amp_vocr_dedupe output to show the deduplicated VOCR texts in the contact sheet. Depending on the parameters set for the Tesseract Video OCR step, duplicate texts may still appear.

Use case 2: Metadata from closing credits

A collection manager wants to add all of the contributors to a motion picture to the metadata record. They just know just the texts that appeared in the video, not where or how many times. They send their video files through a Video OCR workflow, which returns the list of texts along with a contact sheet of frames of texts detected in the video and their corresponding timestamps that they can use as reference as they're adding metadata to the records.

Notes:

[In this example, the collection manager is using Azure for video OCR. Because Azure rolls all of its video tools into one service, the rights specialist must add the ]Azure Video Indexer[ step first and then add ]Azure Video OCR[ to convert the video OCR data from ]Azure Video Indexer[ into AMP JSON for further conversion. Azure Video Indexer also outputs an Azure Artifact OCR file which includes additional data that is needed to create the AMP JSON output. ]
[This example shows two final outputs for the workflow--contact sheets with a thumbnail for each instance of on-screen text and the same data output into a CSV file.]

Evaluating Video OCR MGMs

There are two tests for evaluating the accuracy of identifying and recognizing text in video with video OCR MGMs: Precision/Recall of Texts and Precision/Recall of Unique Texts.

Precision/Recall of Texts

This test inputs a structured list of texts and compares it to the Video OCR output to find true positives, false positives, and false negatives and calculate precision, recall, and F1 scores, matching on a list of texts to see where text was detected. This is useful if you don't want to record timestamps.

Scores Generated

Total GT texts
Total MGM texts
Count of true positives
Count of false negatives
Count of false positives
Precision
Recall
F1
Accuracy

Output Comparison

This test outputs a table with a list of the ground truth texts and texts found by the MGM. If the text was found in both ground truth and MGM, the comparison is a true positive, if it was found only by the MGM, the comparison will be false positive, and if it is only in the ground truth, the comparison will be false negative. The counts for each in the ground truth data and MGM data are also listed.

Example:

text	gt_count	mgm_count	comparison
INTERNATIONAL	1	1	true positive
FILM BUREAU INC.	1	1	true positive
presents	1	0	false negative
present	0	1	false positive
Journey	1	1	true positive

Creating Ground Truth

Create a CSV file with one column labeled "text" with a list of all texts in the video, one line of text per row.

Precision/Recall of Unique Texts

This test inputs a structured list of texts and compares it to the Video OCR output to find true positives, false positives, and false negatives, matching on a unique list of texts to see where text was detected. This is useful if you don't care when the text appears in the video.

Scores Generated

Total GT unique texts
Total MGM unique texts
Count of true positives
Count of false negatives
Count of false positives
Precision
Recall
F1
Accuracy

Output Comparison

Example:

text	gt_count	mgm_count	comparison
INTERNATIONAL	3	2	true positive
FILM BUREAU INC.	3	2	true positive
presents	1	0	false negative
present	0	1	false positive
Journey	3	4	true positive

Creating Ground Truth

Create a CSV file with one column labeled "text" with a list of all unique texts in the video, one line of text per row.

Sample Evaluation Use Cases

Use case 1: Metadata from closing credits

Success measures

Key metrics

High recall of texts (most all the ground truth texts are found in the output, at least one occurrence of each)
High precision of texts (results are mostly correct, few false positives)

Qualitative measures

OCR does a good job of capturing the credits (other texts are less important)

AMP JSON Output

Element	Datatype	Obligation	Definition
media	object	required	Wrapper for metadata about the source media file.
media.filename	string	required	Filename of the source file.
media.duration	string	required	The duration of the source file.
media.frameRate	number	required	The frame rate of the video, in FPS.
media.numFrames	number	required	The number of frames in the video.
media.resolution	object	required	Resolution of the video.
media.resolution.width	number	required	Width of the frame, in pixels.
media.resolution.height	number	required	Height of the frame, in pixels.
frames	array	required	List of frames containing text.
frames[*]	object	optional	A frame containing text.
frames[*].start	string (s.fff)	required	Time of the frame, in seconds.
frames[*].objects	list	required	List of instances in the frame containing text.
frames[].objects[]	object	required	An instance in the frame containing text.
frames[].objects[].text	string	required	The text within the instance.
frames[].objects[].language	string	optional	The language of the detected text, (in localized ISO 639-1 code, ex. "en-US").
frames[].objects[].score	object	optional	A confidence or relevance score for the text.
frames[].objects[].score.type	string (confidence \| relevance)	required	The type of score, confidence or relevance.
frames[].objects[].score.value	number	required	The score value, typically a number in the range of 0-1.
frames[].objects[].vertices	object	required	The top left (xmin, ymin) and bottom right (xmax, ymax) relative bounding coordinates.
frames[].objects[].vertices.xmin	number	required	The top left x coordinate.
frames[].objects[].vertices.ymin	number	required	The top left y coordinate.
frames[].objects[].vertices.xmax	number	required	The bottom right x coordinate.
frames[].objects[].vertices.ymax	number	required	The bottom right y coordinate.

Sample output

Sample Output

{
    "media": {
        "filename": "myfile.mov",
        "duration": "8334.335",
        "frameRate": 30.000,
        "frameNum": 1547,
        "resolution": {
            "width": 654,
            "height": 486
        }
    },
    "frames": [
        {
            "start": "625.024",
            "objects": [
                {
                    "text": "Beliefs",
                    "language": "en-US",
                    "score": {
                        "type": "confidence",
                        "scoreValue": 0.9903119
                    },
                    "vertices": {
                        "xmin": 219,
                        "ymin": 21,
                        "xmax": 219,
                        "ymax": 21
                    }
                }
            ]
        }
    ]
}

Attachments:

VOCR1.png (image/png)

VOCR2.png (image/png)

Screen Shot 2022-10-06 at 8.09.07 PM.png (image/png)

Screen Shot 2022-10-06 at 8.22.58 PM.png (image/png)

Document generated by Confluence on Feb 25, 2025 10:39