Notes for ORSUM Paper - Ljia1009/LING573_AutoMeta GitHub Wiki

1 Introduction

Discussions in scientific documents are nuanced and multi-faceted
- Multiple viewpoints may coexist and no single opinion dominates
Most abstractive product opinion summarization datasets are synthetic
- Redundant cut-and-paste extracts built by combining extracted snippets or by sampling
Scientific meta-reviews: controversies, consensuses, decision making
Distinct challenges: decision consistency, discussion involvement
Proposed prompting method: Checklist-guided Iterative Introspection
Proposed evaluation framework: reference-free LLM-based metrics on key quality aspects of meta-reviews

Three stages of opinion summarization:
- aspect extraction: identifies specific features
- polarity identification: assesses the sentiment towards each aspect
- summary generation: compiles aspects and sentiments
Unsupervised abstractive approaches are validated to be more fluent, informative, coherent, and concise than traditional extractive summaries
Synthetic pseudo-summaries in the product review domain: detached from real-world distributions, possibly irrelevant or inconsistent with input documents, and ignoring important underlying details
MReD focuses on structure-controlled text generation, while ORSUM provides a prompting-based solution, with broader evaluations

Collection
- Open-sourced paper and human written meta-reviews from OpenReview
- 15,062 meta-reviews and 57,536 reviews from 47 conference venues
  - Excluded papers with meta-reviews shorter than 20 tokens and comments by non-official reviewers
- train/validation/test split: 9,890/549/550
- URL, title, abstract, decision, meta-review from the area chair, and reviews from individual reviewers
Comparison
- Higher percentage of novel 4-grams indicates greater abstractiveness
- Lower Normalized Inverse of Diversity signifies lower redundancy, indicating that many reviews address distinct aspects
Composition Analysis [ngaeghy: the reasoning of this section doesn't make much sense]
- Meta-reviews assessed by human on discussion involvement: advantages/disadvantages, agreements/disagreements.
- Low percentage of comprehensive reviews: gap in coverage and thoroughness, may affect the performance and reliability of trained models.

Break the task into multiple steps, consistently requesting evidence for each.
A checklist-guided self-feedback mechanism using self-feedback derived from questions in a predefined checklist
Initial Run generates a draft in four steps:
- (1) prompt to extract and rank opinions, while including sentiment, aspect, and evidence; each review truncated to 300 tokens
- (2) prompt to list the most important advantages and disadvantages, their evidence, and their reviewers
- (3) prompt to list the consensuses and controversies, their evidence, and their reviewers
- (4) given the acceptance/rejection decision, prompt to write a metareview based on information from steps (1)–(3)
Iterative Run: iteratively poses questions, obtains self-feedback, requests further refinement
- select an assessment question from a pre-constructed list of questions, covering the four most crucial aspects of meta-reviews
  - checklist can be expanded and adapted to other complex text generation tasks
  - collect the refinement suggestions, use as prompts to generate a revised version of the meta-review
  - checklist questions posed sequentially in one iterative run, number of iterations as a hyper-parameter
Benefits:
- eliminates the need for external scoring functions that demand training data or human annotations
- provides a general solution for employing LLMs as black boxes in complex text generation tasks

Task evaluation should be multifaceted and beyond n-gram similarity
- current ones inadequate
Proposed comprehensive evaluation framework combines standard evaluation metrics with LLM-based evaluation metrics
Standard metrics in NLG:
- relevance: ROUGE-L (longest common subsequence), BERTScore (contextualized embeddings)
- factual consistency: FACTCC (target claim consistent with source facts), SummaC (sentence-level inference models for inconsistency detection)
- semantic coherence: DiscoScore (6 BERT-based model on discourse coherence averaged )
LLM-based Metrics
- Supplementary measures with reference-free LLM-based metrics
- Key aspects:
  - Discussion Involvement: strengths and weaknesses, agreements and disagreements amongst reviewers
  - Opinion Faithfulness: whether the meta-review contradicts reviewers’ opinions
  - Decision Consistency: whether the meta-review accurately reflects the final decision
- G-EVAL, GPTLikert: high correlation with human-based judgments

Automatic Evaluation:
- Reference-based metrics biased towards the reference
- LLM-based metrics favor specific dimensions given in their prompts
- Human meta-reviews in the dataset scored among the lowest in all categories
Human Evaluation
- Informativeness, Soundness, Self-Consistency, and Faithfulness
- Prompting-based method exhibits less hallucination due to the evidence requirements
- Hallucinations in LLMs more likely with consensuses and controversies
General observations from case study
- Hallucination alleviated with CGI2 as it's constantly asked for evidence
- Redundant summary sentences by CGI2
- No recommendations and discussion from the vanilla prompting baseline, not understanding the complex task requirement
- Iterative refinement sometimes improves the concreteness of opinion discussion
- 2 problems with iterative refinements
  - Suggestions provided by LLM usually generic and less useful for further refinement
  - More self-refinement iterations cause the model to forget the initial instructions for opinion extraction and discussion

Human-written summaries do not always satisfy the criteria of an ideal meta-review
The combination of task decomposition and iterative self-refinement shows promise
Majority of meta-reviews are in Machine Learning, and many papers have been accepted, potentially not applicable to datasets in other domains
Author rebuttals not included as input
Future extensions
- Incorporation of author rebuttals into the input
- Effective and efficient hallucination detection tool