MGM Evaluation - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

MGM Evaluation

Conducting a New Test
Reviewing and Visualizing Test Results
- Review Scores
- Review Outputs

How Ground Truth Testing Works
True Positives, False Positives, and False Negatives
Precision, Recall, and F1 Scores
- Precision
- Recall
- F1

Overview

As you begin to work with the different AMP MGMs, you will notice a wide range in output accuracy with each tool. These variations in accuracy can depend upon many different variables, such as media quality, content, color vs. black and white, and others. An MGM that works well for one type of archival material may produce poor results for another, so it is important to test MGMs before relying on them to produce reliable results for new content or collections.

The MGM Evaluation module enables you to assess the quality of MGM outputs through ground truth testing. "Ground truth" is an ideal output from a machine learning tool, a gold standard that can be used as a benchmark to quantitatively measure an MGM's output against. Ground truth data can be created from scratch by humans or by editing the output of an MGM. The AMP MGM Evaluation module contains tests, or scripts, that compare MGM outputs against the ground truth data you create and then generate several types of accuracy scores to help you decide how or if you want to use the MGM for your collection or use case.

Follow the general guidance below for creating ground truth and conducting a test. Available tests for evaluating MGMs and understanding results are described in more detail on each MGM category wiki page:

Speech-to Text
Audio Segmentation
Named Entity Recognition (NER)
Shot Detection
Video OCR
Applause Detection
Facial Recognition

Using the MGM Evaluation Module

The MGM Evaluation home page ("All MGMs" under MGM Evaluation in header navigation) shows all MGM categories with ground truth tests. Click "View" for any category to conduct a new test or review results of previous tests.

On each MGM category page, you can read a description of the MGM category and see available MGMs in AMP. If ground truth tests have been conducted for this category, you will see entries for these tests in the table below the MGM listings. You can conduct a new test by clicking on the "New Test" tab above the table or the orange "New Test" button on the right side of the screen.

Conducting a New Test

1) Select a Test: On the New Test page, you will first need to select a test to run. Expand the test box to read a summary of the test and to download a template for creating ground truth data for each MGM output file you want to test. (You can find more detail on each of these tests and how to create ground truth on each MGM category's wiki page.)

After you have created your ground truth data, you will be ready to select your test and proceed. Select the radio button for the test you want to run. If there are additional parameters to set for the test they will display below the test list.

2) Set Parameters: In some cases, you may need to specify additional parameters for running a test. These parameters are different than the parameters you may have set for running an MGM on your media file in that they relate to how the comparison between the MGM output and the ground truth should be run. A common parameter is analysis threshold, which is usually used when comparing MGM outputs that include time codes. For example, you may be running a test that compares the time codes of segments of speech, music, silence, and noise in the output of an audio segmentation MGM against ground truth data. Setting the analysis threshold parameter to 2 seconds will count MGM segments and ground truth segments as matches (or "true positives") if their start or end times fall within two seconds of each other. Setting the analysis threshold to 1 second would apply a stricter requirement for counting true positives.

Follow the guidance in the parameter box to determine what value to set for each parameter. Additional guidance may be found on each MGM category's wiki page.

3) Select MGM Outputs to Test: The MGM Evaluation module assumes you have already run your file through an MGM to generate output. If you have not done this yet, you will need to send your file through a workflow to create the necessary output before running the test. You should also have created the ground truth data for each output file you want to test. If you have not yet done this, download the ground truth template and create the ground truth data using the instructions noted in the test description or using the fuller instructions included on the MGM category wiki page.

When you have generated your MGM outputs and created your ground truth, find and select the MGM outputs in the table. (You may need to use filters to help you find the right MGM outputs.) Notice that as you select the checkboxes for each MGM output, a corresponding row is added to the table below in the next step, "Upload or Select Ground Truth Data."

4) Upload or Select Ground Truth Data: After selecting your MGM outputs, you will need to associate the ground truth data with each one by uploading your ground truth data files or selecting files that may have already been uploaded. For each row in this table, click the "Upload/Select Ground Truth" button to open the upload/select modal. If there are already ground truth data files associated with this media file for this MGM category, you should see them listed in this modal. You can select one of these, or upload a new ground truth data file using the uploader on the modal. (You may wish to use the Label or Description to help distinguish ground truth files from one another if you plan to use them in the future.) Select one file as your ground truth data, then click "Save." You should now see the ground truth file name listed in the row under "Ground Truth." If you decide to not include one or more MGM outputs in your test, you can remove them from the test by clicking "Remove Row."

Once you have selected a ground truth data file for every row listed in this section, you are ready to start the test. Click the orange "Submit" button to start the test. If the test was successful, you will be redirected to the MGM category page, where you will see your test results listed in the table and available to select for visualization.

Reviewing and Visualizing Test Results

After conducting a successful test, you can then review and compare test scores for one or more MGM outputs and/or for one or more MGMs. If you aren't already on the MGM category page after submitting a test on the New Test page, you can get to this page from the MGM Evaluation home page ("All MGMs" under "MGM Evaluation" in the header navigation) or from the MGM category link in the header bar, under "MGM Evaluation." If you are on the New Test page for an MGM category, you can get to this page by clicking on the "Test Results" tab.

Under the Test Results tab, select a test from the dropdown menu for the type of test results you would like to review. If any tests of this type have been conducted, their results will display in a table below. Use the filters to narrow your search and select the test results you would like to review or visualize together. When you have finished selecting test results, click the "Visualize" button at the bottom of the page.

Review Scores

Clicking the "Visualize" button takes you to the "Review Scores" tab of the visualization page. Here you can view and compare the ground truth testing scores as a table or bar chart for the files you selected. This is a helpful view for comparing the accuracy of a single MGM across multiple different files or for comparing the performance of two or more MGMs by quantitatively assessing their accuracy scores for one or more files. You can learn more about how to interpret these scores below in the "Interpreting Ground Truth Test Scores" section.

Review Outputs

Click the "Review Outputs" tab to compare the MGM outputs side-by-side against the ground truth data in a table view. This is a helpful view for qualitatively assessing the quality of MGM outputs because it allows you to see what kinds of errors the MGM is making and determine how detrimental those errors may be for your use case. You can view the outputs for only one file at a time in this view, so you will need to toggle the radio buttons on the left sidebar to view different files or the radio buttons below the table to view the results for different tools or test parameter settings.

Interpreting Ground Truth Test Scores

Ground truth testing is a method of comparing the output of a machine learning tool against a "gold standard," or ideal output, to measure the accuracy of the tool. Accuracy scores like precision, recall, and F1, can be assessed quantitatively to help you estimate how an MGM will perform for your particular collection content or use cases. "Quality" can mean different things to different people for different use cases, so it is important to test the AMP MGMs for yourself to decide if the results will be good enough for what you plan to use them for. This section gives an overview of ground truth testing and interpreting some evaluation metrics. See each MGM category's wiki page for more detail on creating ground truth and interpreting scores for specific tests.

How Ground Truth Testing Works

Generally, the process you will use for ground truth testing will look like this:

Create ground truth data for a media file in a structured format, like CSV or JSON.
Run the media file through the MGM.
Convert the MGM output to the same data format as the ground truth.
Compare the MGM output to the ground truth dataset using a script or other method of calculating accuracy scores.
Calculate quality metrics, like precision, recall, and F1 scores, and output these scores into a human readable or visualizable format.
Review the scores with your team to assess the quality of the MGM for your collections and use cases.

True Positives, False Positives, and False Negatives

The process of ground truth testing usually involves comparing MGM outputs and ground truth data and classifying those comparisons as true positives, false positives or false negatives.

True positives are MGM predictions that matched outputs in the ground truth.

False positives are MGM predictions that weren't in the ground truth.

False negatives are predictions in the ground truth that weren't identified by the MGM.

For example, suppose you run a Video OCR MGM on a video file with text and you run a test to determine the accuracy of the MGM based on a comparison of the unique texts found in the MGM results and the ground truth. You might visualize these comparisons as follows:

| Ground Truth | MGM | Comparison |

| Joe Louis | Joe Louis | true positive |

| A Great Depression | \ | false negative |

| \ | A Great Depresslom | false positive |

| and World War II | and World War II | true positive |

| American Hero | American Hero | true positive |

| Lilly Library | \ | false negative |

| \ | Liily Llbrary | false positive |

| Indiana University | Indiana University | true positive |

| September 12, 1991 | September 12, 1991 | true positive |

Precision, Recall, and F1 Scores

We can use these comparisons to calculate several types of accuracy scores:

Precision

Precision tells us how accurate our MGM was in matching up with our ground truth. High precision means that the MGM did a good job of only matching with the ground truth without generating too many false positives. To calculate recall, we use this formula:

Precision = True Positives / (True Positives - False Positives)

Precision is an important measure to consider when the cost of false positives is high. For example, if there is concern about potentially inaccurate or offensive terms making their way in the metadata, more staff time could be required to review results if there is a larger number of false positives to review. Requiring a high threshold for precision would help to minimize this risk.

Recall

Recall tells us how well the MGM did in finding as many true positives as possible, (even if it also identifies a lot of false positives). To calculate recall, we use this formua:

Recall = True Positives / (True Positives - False Negatives)

Recall is an important measure to consider when the cost of false negatives is high, for instance, when the risk of important terms not being included is detrimental to a user's search. If a high number of false positives is not an issue for this use case, requiring a high recall score would help to mitigate this risk.

F1

F1 gives us the harmonic mean between precision and recall, considered a balance between the two. F1 is calculated using this formula:

F1 = 2 * (Precision * Recall) / (Precision - Recall)

While F1 can be used as a rough measure of quality, none of these metrics alone should be trusted to reveal the whole picture of quality.

Attachments:

Screen Shot 2022-12-20 at 10.29.50 AM.png (image/png)
Screen Shot 2022-12-20 at 10.42.34 AM.png (image/png)
Screen Shot 2022-12-20 at 1.39.14 PM.png (image/png)
Screen Shot 2022-12-20 at 2.02.35 PM.png (image/png)
MGM evaluation intro.jpg (image/jpeg)\

Document generated by Confluence on Feb 25, 2025 10:39