Proposed evaluation framework: reference-free LLM-based metrics on key quality aspects of meta-reviews
2 Related Work
Three stages of opinion summarization:
aspect extraction: identifies specific features
polarity identification: assesses the sentiment towards each aspect
summary generation: compiles aspects and sentiments
Unsupervised abstractive approaches are validated to be more fluent, informative, coherent, and concise than traditional extractive summaries
Synthetic pseudo-summaries in the product review domain: detached from real-world distributions, possibly irrelevant or inconsistent with input documents, and ignoring important underlying details
MReD focuses on structure-controlled text generation, while ORSUM provides a prompting-based solution, with broader evaluations
3 Task Formation
Task of Scientific Opinion Summarization
Given: a research paper’s title, abstract, and set of reviews
Generates: a meta-review, summarizing the reviews’ opinions, to make a decision recommendation for acceptance or rejection
Entails:
Summarizing the paper’s key strengths and weaknesses
Explicitly evaluating whether the strengths surpass the weaknesses
4 Dataset: ORSUM
Collection
Open-sourced paper and human written meta-reviews from OpenReview
15,062 meta-reviews and 57,536 reviews from 47 conference venues
Excluded papers with meta-reviews shorter than 20 tokens and comments by non-official reviewers
train/validation/test split: 9,890/549/550
URL, title, abstract, decision, meta-review from the area chair, and reviews from individual reviewers
Comparison
Higher percentage of novel 4-grams indicates greater abstractiveness
Lower Normalized Inverse of Diversity signifies lower redundancy, indicating that many reviews address distinct aspects
Composition Analysis [ngaeghy: the reasoning of this section doesn't make much sense]
Meta-reviews assessed by human on discussion involvement: advantages/disadvantages, agreements/disagreements.
Low percentage of comprehensive reviews: gap in coverage and thoroughness, may affect the performance and reliability
of trained models.