Notes for ORSUM Paper - Ljia1009/LING573_AutoMeta GitHub Wiki

1 Introduction

  • Discussions in scientific documents are nuanced and multi-faceted
    • Multiple viewpoints may coexist and no single opinion dominates
  • Most abstractive product opinion summarization datasets are synthetic
    • Redundant cut-and-paste extracts built by combining extracted snippets or by sampling
  • Scientific meta-reviews: controversies, consensuses, decision making
  • Distinct challenges: decision consistency, discussion involvement
  • Proposed prompting method: Checklist-guided Iterative Introspection
  • Proposed evaluation framework: reference-free LLM-based metrics on key quality aspects of meta-reviews

2 Related Work

  • Three stages of opinion summarization:
    • aspect extraction: identifies specific features
    • polarity identification: assesses the sentiment towards each aspect
    • summary generation: compiles aspects and sentiments
  • Unsupervised abstractive approaches are validated to be more fluent, informative, coherent, and concise than traditional extractive summaries
  • Synthetic pseudo-summaries in the product review domain: detached from real-world distributions, possibly irrelevant or inconsistent with input documents, and ignoring important underlying details
  • MReD focuses on structure-controlled text generation, while ORSUM provides a prompting-based solution, with broader evaluations

3 Task Formation

  • Task of Scientific Opinion Summarization
    • Given: a research paper’s title, abstract, and set of reviews
    • Generates: a meta-review, summarizing the reviews’ opinions, to make a decision recommendation for acceptance or rejection
    • Entails:
      • Summarizing the paper’s key strengths and weaknesses
      • Explicitly evaluating whether the strengths surpass the weaknesses

4 Dataset: ORSUM

  • Collection
    • Open-sourced paper and human written meta-reviews from OpenReview
    • 15,062 meta-reviews and 57,536 reviews from 47 conference venues
      • Excluded papers with meta-reviews shorter than 20 tokens and comments by non-official reviewers
    • train/validation/test split: 9,890/549/550
    • URL, title, abstract, decision, meta-review from the area chair, and reviews from individual reviewers
  • Comparison
    • Higher percentage of novel 4-grams indicates greater abstractiveness
    • Lower Normalized Inverse of Diversity signifies lower redundancy, indicating that many reviews address distinct aspects
  • Composition Analysis [ngaeghy: the reasoning of this section doesn't make much sense]
    • Meta-reviews assessed by human on discussion involvement: advantages/disadvantages, agreements/disagreements.
    • Low percentage of comprehensive reviews: gap in coverage and thoroughness, may affect the performance and reliability of trained models.

5 Prompting Method: Checklist-guided Iterative Introspection

  • Break the task into multiple steps, consistently requesting evidence for each.
  • A checklist-guided self-feedback mechanism using self-feedback derived from questions in a predefined checklist
  • Initial Run generates a draft in four steps:
    • (1) prompt to extract and rank opinions, while including sentiment, aspect, and evidence; each review truncated to 300 tokens
    • (2) prompt to list the most important advantages and disadvantages, their evidence, and their reviewers
    • (3) prompt to list the consensuses and controversies, their evidence, and their reviewers
    • (4) given the acceptance/rejection decision, prompt to write a metareview based on information from steps (1)–(3)
  • Iterative Run: iteratively poses questions, obtains self-feedback, requests further refinement
    • select an assessment question from a pre-constructed list of questions, covering the four most crucial aspects of meta-reviews
      • checklist can be expanded and adapted to other complex text generation tasks
      • collect the refinement suggestions, use as prompts to generate a revised version of the meta-review
      • checklist questions posed sequentially in one iterative run, number of iterations as a hyper-parameter
  • Benefits:
    • eliminates the need for external scoring functions that demand training data or human annotations
    • provides a general solution for employing LLMs as black boxes in complex text generation tasks

6 Evaluation

  • Task evaluation should be multifaceted and beyond n-gram similarity
    • current ones inadequate
  • Proposed comprehensive evaluation framework combines standard evaluation metrics with LLM-based evaluation metrics
  • Standard metrics in NLG:
    • relevance: ROUGE-L (longest common subsequence), BERTScore (contextualized embeddings)
    • factual consistency: FACTCC (target claim consistent with source facts), SummaC (sentence-level inference models for inconsistency detection)
    • semantic coherence: DiscoScore (6 BERT-based model on discourse coherence averaged )
  • LLM-based Metrics
    • Supplementary measures with reference-free LLM-based metrics
    • Key aspects:
      • Discussion Involvement: strengths and weaknesses, agreements and disagreements amongst reviewers
      • Opinion Faithfulness: whether the meta-review contradicts reviewers’ opinions
      • Decision Consistency: whether the meta-review accurately reflects the final decision
    • G-EVAL, GPTLikert: high correlation with human-based judgments

7 Experiments

  • Automatic Evaluation:
    • Reference-based metrics biased towards the reference
    • LLM-based metrics favor specific dimensions given in their prompts
    • Human meta-reviews in the dataset scored among the lowest in all categories
  • Human Evaluation
    • Informativeness, Soundness, Self-Consistency, and Faithfulness
    • Prompting-based method exhibits less hallucination due to the evidence requirements
    • Hallucinations in LLMs more likely with consensuses and controversies
  • General observations from case study
    • Hallucination alleviated with CGI2 as it's constantly asked for evidence
    • Redundant summary sentences by CGI2
    • No recommendations and discussion from the vanilla prompting baseline, not understanding the complex task requirement
    • Iterative refinement sometimes improves the concreteness of opinion discussion
    • 2 problems with iterative refinements
      • Suggestions provided by LLM usually generic and less useful for further refinement
      • More self-refinement iterations cause the model to forget the initial instructions for opinion extraction and discussion

8 Conclusions, Future Work, Limitations

  • Human-written summaries do not always satisfy the criteria of an ideal meta-review
  • The combination of task decomposition and iterative self-refinement shows promise
  • Majority of meta-reviews are in Machine Learning, and many papers have been accepted, potentially not applicable to datasets in other domains
  • Author rebuttals not included as input
  • Future extensions
    • Incorporation of author rebuttals into the input
    • Effective and efficient hallucination detection tool