The Gigaword Sentence Summarization Corpus - sdrobert/pytorch-database-prep GitHub Wiki
This page serves to outline this repository's treatment of the Gigaword Summarization Corpus, as well as any information one should be aware of when working with the corpus.
The Gigaword Summarization Corpus (we call it GGWS for short, though this is not canon) is a collection of headlines and first sentences of news articles filtered from the English Gigaword Corpus. The task, pioneered by Rush et al. 2015, is to predict the headline from the first sentence. Performance is evaluated using the ROUGE metric. While the original paper evaluated on the DUC 2003 and DUC 2004 test sets, these days only ROUGE scores on the held-out GGSW training and test sets are reported.
Relatively small and unstructured, GGSW is perfect for proof-of-concept work.
Though the first version of the GGSW can be found here, we assume the UniLM version of the database has been downloaded from here.
Though the original Gigaword Corpus is licensed by the LDC, the GGSW appears to be MIT licensed. Caveat Emptor.
The primary interface for GGWS is through ggws.py
. As far as I can tell, there is no "standard" setup for the task. Various authors use different size vocabularies or even subwords. These commands create word-level sequences consisting of all the word types in the training set (about 124k words, excluding the words pre-filtered by the task developers).
GGWS=/path/to/unilm/version/of/ggws
python pytorch-database-prep/ggws.py data/ preamble $GGWS
python pytorch-database-prep/ggws.py data/ init_word
python pytorch-database-prep/ggws.py data/ torch_dir # WARNING: takes a long time!
Upon completion, the data/
directory will look something like
data/
local/
train/
feat/
ref/
dev/
feat/
ref/
test/
feat/
ext/
id2token.txt
token2id.txt
train.sent.trn
train.head.trn
dev.sent.trn
dev.head.trn
test.sent.trn
test.head.trn
...
Files in feat/
and ref/
have the file paths (feat|ref)/sent_<partition>_<sent_no>.pt
, where partition
is one of train
, dev
, or test
, and sent_no
is the 0-indexed sentence number from the original text files. Files in feat/
are the first sentences of the articles; files in ref/
are the headlines. The sentence files (in feat/
) are token sequences of shape (S, 1)
, where S
is the sentence length; the headline files in ref/
are also token sequences but of shape (T,)
, where T
is the headline length. The additional dimension in the sentence sequences allows data/
to be batched using pydrobert-pytorch
's SpectDataTrainingLoader
and SpectDataEvaluationLoader
. However, there is no requirement to use these loaders - feel free to use your own. token2id.txt
and id2token.txt
map between token ids (numerical) and token types (strings).
You can generate a decent baseline by merely extracting the first 75-ish characters-worth of words from the sentence and treating that as the header. To do so, call
python pytorch-database-prep/ggws.py data/ prefix_baseline prefix.trn
Dev and test "summaries" are generated by default. ROUGE recall scores are about one point lower than those reported by Rush et al. (2015), probably because we use UniLM data instead of the Harvard NLP data.
For evaluation, we provide a simple script that, given one or more TRN file, produces a directory that can be processed by ROUGE-1.5.5 (N.B. the original appears to be lost to time, but you can find a version here).
# generate the folder
python pytorch-database-prep/ggws.py data/ rouge_dir data/local/rouge prefix.trn my-awesome-network-output.trn
# run ROUGE
cd /path/to/rouge/install
perl ROUGE-1.5.5.pl -e data -a -n 2 /path/to/data/local/rouge/settings.xml
If you don't want to deal with the annoying setup of Perl and its dependencies, you can use my wrapper around py-rouge.
python pytorch-database-prep/rouge-1.5.5.py -a -n 2 data/local/rouge/test/settings.xml
rouge-1.5.5.py
requires NLTK for tokenizing/stemming. I find that the implementation produces scores slightly different from those from the perl implementation by about a hundredth of a percent. It might be worth sticking with the official version just for consistency if you're battling for state-of-the-art.
Py-rouge is a fabulous package that is Apache-licensed. Please go show the developer some love!
-
UNK
has been replaced with<unk>
in the test set.<unk>
was used in training anyway. - The character
_
has been replaced with#&95;
in transcriptions so that we can use the underscore for subwords.