Evaluation metrics - Ljia1009/LING573_AutoMeta GitHub Wiki

Current metrics

The metrics employed in this evaluation align with those utilized in the original ORSUM paper and are widely recognized in the field of natural language generation. They assess various aspects of text quality, including relevance, factual consistency, and semantic coherence.

ROUGE-L

ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) measures the longest common subsequence between the generated and reference texts. It captures the fluency and informativeness of the generated text by evaluating the overlap of sequential word patterns.

BERTScore

BERTScore leverages pre-trained contextual embeddings from BERT to compute cosine similarity between candidate and reference sentences. It has been shown to correlate well with human judgment at both the sentence and system levels, making it a robust metric for assessing semantic similarity.

site: https://huggingface.co/spaces/evaluate-metric/bertscore

FACTCC

FACTCC evaluates the factual consistency of generated text by checking whether claims made in the generated content align with the information presented in the source document.

site: https://huggingface.co/manueldeprada/FactCC paper: https://arxiv.org/pdf/1910.12840

SummaC

SummaC utilizes sentence-level natural language inference models to detect inconsistencies in the generated text. By evaluating the logical entailment between sentences, it helps identify contradictions and ensures the coherence of the generated content. They introduce a new approach for inconsistency detection based on the aggregation of sentence-level entailment scores for each pair of input document and summary sentences. There are two model variants that differ in the way they aggregate sentence-level scores into a single score. SUMMAC_ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. SUMMAC_Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score. SUMMAC_ZS is directly interpretable, while SUMMAC_Conv generally performs better.

site: https://github.com/tingofurro/summac

DiscoScore

DiscoScore is a discourse-level evaluation metric that assesses text coherence using BERT embeddings. It models discourse coherence from different perspectives, driven by Centering theory, and has been shown to correlate strongly with human ratings of coherence and factual consistency. There are two variants of DiscoScore, FocusDiff and SentGraph. With regard to discourse metrics, RC, LC, Entity Graph and Lexical Chain are considered. Also two focus choices were considered, noun (NN) and semantic entity (Entity). RC and LC that use discourse features outperform most of non-discourse metrics and coherence models (i.e., Entity and Lexical Graph) without the access to source texts and references, but they are worse than both DS-FOCUS and DS-SENT. When evaluating on SUMMEval, DS-FOCUS performs better than DS-SENT and DS-Focus (NN) outperforms DS-Focus (Entity) regarding to coherence, consistency, fluency and relevance.

site: https://github.com/AIPHES/DiscoScore

Future Directions

One key question is whether we need separate evaluation metrics for abstractive vs extractive summaries. An abstractive summary often paraphrases or synthesizes information, so it may look quite different from the original text. In contrast, an extractive summary directly uses phrases or sentences from the source. This difference suggests that a single metric might not suit both styles equally well. For example, a purely extractive summary will naturally score higher on word-overlap metrics, whereas an abstractive summary might convey the same meaning with different words and get undervalued by those same overlap-based metrics.

In multi-document summarization, evaluation becomes even trickier. A good summary is expected to combine information from multiple sources, which means it won’t mirror any one source document exactly. The summary might omit details from each individual document while still being a correct synthesis of the overall content. Traditional metrics that compare a summary to a single reference can struggle here. If we simply compare a multi-doc summary to one source at a time, the overlaps will be low, yet the summary could still be perfectly valid. We may consider metrics that account for the union of information across all source documents.