Notes for MReD Paper - Ljia1009/LING573_AutoMeta GitHub Wiki

1 Introduction

Fully-annotated meta-review dataset, better use of the domain knowledge for text generation.
- In-depth understanding of the structure of meta-reviews in a peer-reviewing system, namely the open review system of ICLR.
New task of controllable generation focusing on controlling the passage macro structures.
- Controlling not only the intent of a single generated sentence but also the whole structure of a generated passage
Simple yet effective control methods independent of the model architecture.

2 Data

ICLR meta-reviews, 2018-2021, from OpenReview.
7,089 meta-reviews corresponding to 23,675 reviews.
45,929 sentences from 7,089 meta-reviews
labelled with 9 pre-defined intent categories: abstract, strength, weakness, suggestion, rebuttal process, rating summary, area chair (AC) disagreement, decision, and miscellaneous (misc).
- abstract, strength, weakness easily summarized from the reviewers’ comment
- middle range long, high & low score short
- common patterns, e.g. abstract at the beginning, suggestion and decision at the end

3 Task & Methods

Task definition of structure-controllable text generation: given the text input (i.e., reviews) and a control sequence of the output structure, a model should generate a meta-review that is derivable from the reviews and presents the required structure.
Explored how to re-organize the input reviews and the control structure as an input sequence of the encoder.
- Add the control sequence in front of the input text.
- Linearize multiple review inputs into a single input.
  - rate-concat
  - rate-merge
  - longest-review (baseline)
- Different control methods
  - sent-ctrl
  - seg-ctrl
  - unctrl
Model: bart-large-cnn with PyTorch & HF Transformers

4 Experiments

Baselines
- Extractive: MMR, LexRank, TextRank; unctrl, sent-ctrl
  - sent-ctrl: LSTM-CRF tagger trained on the labeled meta-reviews to predict the labels of each input review sentence.
- Generic: obtaining generic sentences from the training data from either the meta-review references (i.e., target) or the input reviews (i.e., source)
Setting
- Filtered to 6,693 source-target pairs, randomly split into train, validation, and test by 8:1:1.
- Output evaluated against the reference with F1 scores of ROUGE1, ROUGE2, and ROUGE-L.
- k equal to the number of labels in the control sequence for extractive baselines, with sent-ctrl, and same k for the generic baselines.
- Load the pretrained bart-large-cnn model, fine-tune on MReD, single V100 GPUs, batch size 1, target_length 20 to 400, source truncation lengths 1024, 2048, and 3072 tokens, learning rate 5e-5, Adam optimizer with momentum 0.9, 0.999 without warm-up steps or weight decay, seed 0, 3 epochs with gradient accumulation step 1, decoding beam size 4, length penalty 2.
Results
- All controlled methods outperform their unctrl settings.
- For bart-large-cnn, sent-ctrl better than seg-ctrl.
- bart-large-cnn far outperforms the extractive and generic baselines: meta-review writings are different from the input reviews; transformers model is capable of capturing content-specific information.
- rate-concat > merge > rate-merge
- For each generated sentence, the corresponding control token has the highest attention weights; the model can correctly extract relevant information from the source sentences; different control sequences generates varied outputs.
- Human evaluation focused on fluency, content relevance, structure similarity, decision correctness.