Analyzing Annotations - ufal/factgenie GitHub Wiki

Once you collect the annotations, factgenie can help you with computing statistics over the annotation labels:

📊 Web interface

You can find the basic statistics tools in the web interface at /analyze: This interface provides statistics about a single annotation campaign.

In the table, we can find the following columns:

Dataset, split, setup: The source of the corresponding inputs (see terminology).
Category: The annotation span category label.
Ex. annotated: The number of examples annotated within the campaign.
Count: The total number of label occurences within annotated examples.
Avg. per ex.: The average number of label occurences within annotated examples (=Count / Ex. annotated).
Prevalence: A ratio of outputs containing the label (0 to 1 range).

The statistics are provided in full detail and also grouped by various aspects (label categories, setups, datasets).

Note that the page with individual statistics for each campaign can be also opened using the "View statistics" button on the campaign detail page.

🖥️ Command line interface

For detailed statistical analysis of annotation campaigns, factgenie provides two commands: factgenie stats and factgenie iaa.

🧮 Statistics

Basic statistical analysis tools are accessible through the factgenie stats command group:

factgenie stats [command] [options]

Annotation Counts

The counts command provides basic statistics such as annotations per example or and average span length.

Usage:

factgenie stats counts --campaign [ID] [options]

Key parameters:

--campaign: Campaign ID (required)
--annotator-group: Specific annotator group ID (optional, all groups used if omitted)
--include-dataset: Only include specified datasets (can be specified multiple times)
--include-split: Only include specified splits (can be specified multiple times)
--output: Output file to save results (JSON format)

The command reports:

Total examples considered and annotated
Examples with/without annotations
Percentage of empty examples
Total annotation count
Average annotations per example
Average annotation length in characters

Example:

factgenie stats counts --campaign human-eval --annotator-group 1

Confusion Matrix

The confusion matrix command compares annotations between two campaigns or annotator groups, showing how categories assigned by one group match those assigned by another.

Usage:

factgenie stats confusion --ref-campaign [ID] --hyp-campaign [ID] [options]

Key parameters:

--ref-campaign: Reference campaign ID (required)
--ref-group: Reference annotator group (optional, all groups used if omitted)
--hyp-campaign: Hypothesis campaign ID (required)
--hyp-group: Hypothesis annotator group (optional, all groups used if omitted)
--normalize: Flag to normalize matrix by row (reference annotations)
--include-dataset, --include-split, --include-example-id: Filter options
--output: Output CSV file for the confusion matrix
--output-plot: Output file for confusion matrix visualization

The confusion matrix shows:

Cross-tabulation of annotation categories between two groups
Reference counts (row sums) and hypothesis counts (column sums)
Optional normalization for percentage-based comparison

Example:

factgenie stats confusion --ref-campaign human-annot --hyp-campaign llm-eval --normalize --output-plot confusion.png

⚖️ Inter-annotator agreement

For analysis of inter-annotator agreement (IAA), factgenie provides a set of command-line tools that can measure agreement between different campaigns or annotator groups. These metrics help assess the reliability and consistency of annotations.

Factgenie implements three different agreement metrics, each suitable for different scenarios:

F1 Score - A character-level agreement measure that computes precision, recall, and F1 by treating one annotation set as the reference and another as the hypothesis. Useful for comparing annotation quality against a gold standard.
Pearson Correlation - Measures how annotation counts correlate between two groups. Useful for measuring if two annotation groups identify similar numbers of errors, even if not exactly the same spans.
Gamma Agreement - A comprehensive span-level agreement measure that considers both the position and categories of annotations. Particularly suitable for measuring agreement between multiple annotator groups simultaneously.

Using the IAA Command Line Interface

All IAA metrics are accessible through the factgenie iaa command group:

factgenie iaa [metric] [options]

Where [metric] is one of: f1, pearson, or gamma.

F1 Score

The F1 score measures character-level precision, recall, and F1 between reference and hypothesis annotator groups. It's useful for assessing exact annotation overlap.

[!NOTE] How we compute F1 score?

For each character position in the hypothesis spans, the algorithm attempts to find a matching character position in the reference spans. Each character position can only be matched once per reference span. In effect, overlapping spans contribute proportionally to the overall scores:

Precision = overlap_count / total_hypothesis_characters

Recall = overlap_count / total_reference_characters

F1 = 2 * Precision * Recall / (Precision + Recall)

Usage:

factgenie iaa f1 --ref-campaign [ID] --hyp-campaign [ID] [options]

Key parameters:

--ref-campaign: Reference campaign ID (required)
--ref-group: Reference annotator group (optional, all groups used if omitted)
--hyp-campaign: Hypothesis campaign ID (required)
--hyp-group: Hypothesis annotator group (optional, all groups used if omitted)
--match-mode: Either "hard" (requires same category, default) or "soft" (allows any category)
--category-breakdown: Flag to calculate metrics per annotation category
--include-dataset, --include-split, --include-example-id: Filter options
--output: Output file to save results (JSON format)

Example:

factgenie iaa f1 --ref-campaign human-annot --hyp-campaign llm-eval --match-mode hard --category-breakdown

Pearson Correlation

The Pearson correlation measures the linear correlation between annotation counts per example from two annotator groups. This helps assess whether annotators identify similar numbers of errors, even if not at the same positions.

Usage:

factgenie iaa pearson --campaign1 [ID] --campaign2 [ID] [options]

Key parameters:

--campaign1: First campaign ID (required)
--group1: First annotator group (optional, all groups used if omitted)
--campaign2: Second campaign ID (required)
--group2: Second annotator group (optional, all groups used if omitted)
--include-dataset, --include-split, --include-example-id: Filter options
--output: Output file to save results (JSON format)

Example:

factgenie iaa pearson --campaign1 human-annot --campaign2 llm-eval

Gamma Agreement

Gamma agreement is a specialized metric for measuring agreement on segment annotations, see Mathet et al. (2015). In factgenie, we use the Python implementation from the pygamma-agreement package.

Usage:

factgenie iaa gamma --campaign [ID1] --campaign [ID2] [options]

Key parameters:

--campaign: Campaign ID (can be specified multiple times)
--group: Annotator group (can be specified as many times as campaigns, or omitted)
--alpha: Coefficient weighting the positional dissimilarity (default: 1.0)
--beta: Coefficient weighting the categorical dissimilarity (default: 1.0)
--delta-empty: Empty dissimilarity value (default: 1.0)
--soft_gamma: Flag to use soft version of gamma score
--include-dataset, --include-split, --include-example-id: Filter options
--save-plots: Directory to save alignment plots
--output: Output file to save results (JSON format)

Example:

factgenie iaa gamma --campaign human-annot --campaign llm-eval --alpha 0.5 --beta 1.5 --save-plots ./plots

Interpreting Agreement Scores

F1 Score: Ranges from 0 to 1. Higher is better. Values above 0.7 typically indicate good agreement.
Pearson Correlation: Ranges from -1 to 1. Values close to 1 indicate strong positive correlation, values close to 0 indicate no correlation.
Gamma Score: Ranges from -inf to 1. Higher values indicate better agreement, with 1 representing perfect agreement.

Note

Factgenie fixes the numpy random seed before gamma computation to ensure that the scores are reproducible.

Additional Filtering Options

All stats commands support filtering to compute agreement on specific subsets of your data:

--include-dataset: Only include specified datasets (can be specified multiple times)
--include-split: Only include specified splits (can be specified multiple times)
--include-example-id: Only include specified example IDs (can be specified multiple times)

This allows for targeted analysis of agreement on specific parts of your dataset.

Analyzing Annotations - ufal/factgenie GitHub Wiki

📊 Web interface

🖥️ Command line interface

🧮 Statistics

Annotation Counts

Confusion Matrix

⚖️ Inter-annotator agreement

Using the IAA Command Line Interface

F1 Score

Pearson Correlation

Gamma Agreement

Interpreting Agreement Scores

Additional Filtering Options

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️