Corpora - STS-NTNU/STS13 GitHub Wiki
A non-exhaustive list of corpora with semantically similar text.
ILK Headline Paraphrase Corpus
The ILK Headline Paraphrase Corpus, consisting of automatically crawled and aligned headlines, contains 7,400,144 pairs of similar news headlines. Examples:
- man wanted for kidnapping lugoff girl arrested , charged with rape
- man charged with raping girl , 14
- suspect in kidnap charged with rape
- man wanted for kidnapping teen arrested , charged with rape
- girl found after mom gets text message
- kidnapped girl rescued after text-messaging her mom
- kidnapped girl rescued by text message to mom
- man charged with kidnap of girl who was found after messaging mom
PWKP
The Parallel Wikipedia Corpus has over 108k pairs of aligned sentence pairs from Wikipedia and Simple Wikipedia, although part of them has one-to-may sentence mapping. Examples:
- Austria has been a member of the United Nations since 1955, joined the European Union in 1995, and is a founder of the OECD.
- Austria is in the United Nations since 1955 and in the European Union since 1995.
- Austria is a largely mountainous country due to its location in the Alps.
- Austria is a largely mountainous country since it is in the Alps.
- The high mountainous Alps in the west of Austria flatten somewhat into low lands and plains in the east of the country.
- The Alps of western Austria give way somewhat into low lands and plains in the eastern part of the country.
- Austria has been the birthplace of many famous composers such as Wolfgang Amadeus Mozart, Joseph Haydn, Franz Schubert, Anton Bruckner, Johann Strauss, Sr., Joh
- There are Wolfgang Amadeus Mozart, Joseph Haydn, Franz Schubert, Anton Bruckner, Johann Strauss, Sr., Johann Strauss, Jr. and Gustav Mahler.
- In modern times there were Arnold Schoenberg, Anton Webern and Alban Berg, who belonged to the Second Viennese School.
- Typical Austrian dishes include Wiener Schnitzel, Apfelstrudel, Schweinsbraten, Kaiserschmarren, Knödel, Sachertorte and Tafelspitz.
- Famous Austrian dishes are Wiener Schnitzel, Kaiserschmarren, Knödel, Sachertorte and Tafelspitz.
See Zhemin Zhu and Delphine Bernhard, A Monolingual Tree-based Translation Model for Sentence Simplification, COLING2010, August 2010.
WEBSIS-CPC-11
The Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) contains 7,859 candidate paraphrases obtained from Mechanical Turk crowdsourcing. The corpus is made up of 4,067 accepted paraphrases, 3,792 rejected non-paraphrases, and the original texts. Original/paraphrase example:
- "I dipped into these pages, and as I read for the first time some of the odes of The Unknown Eros, I seemed to have made a great discovery: here was a whole glittering and peaceful tract of poetry which was like a new world to me."
- "I pored through these pages, and as I perused the lyrics of The Unknown Eros that I had never read before, I appeared to have found out something wonderful: there before me was an entire shining and calming extract of verses that were like a new universe to me."
- On the very day that this treaty was signed, Bismarck, in answer to an Austrian despatch, wrote insisting that he had no intention of entering on an offensive war against Austria. In private conversation he was more open; to Benedetti he said: "I have at last succeeded in determining a King of Prussia to break the intimate relations of his House with that of Austria, to conclude a treaty of alliance with Italy, to accept arrangements with Imperial France; I am proud of the result."
- On the same day of the treat was signed Bismark wrote to an Austrian delegation as answer that he had no plan to attack Austria. But in a private conversation with Benedetti he opened his mind and said that he had succeeded to decide a King of Prussia to break his relations with Autstria, thus he can finalize a treaty with Italy, to accept arrangements with Imperial France; and he said he was proud of the outcome.
Much of it consists of text longer than one sentence, but can perhaps be automatically aligned.
Microsoft Research Paraphrase Corpus (MRSP)
Microsoft Research Paraphrase Corpus (MSRP) contains 5800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
See also Semilar corpus
Cohn, Lapata & Callison-Burch Paraphrase Corpus
Paraphrase corpus containing 900 paraphrastic sentence pairs with human annotated word/phrase alignments drawn at random from the following corpora:
- the multi-translation Chinese corpus (mtc)
- Jules Verne's 20,000 leagues under the sea (novels)
- MSR paraphrase corpus (news2); includes some non-paraphase sentences
Examples:
-
- japan still remains the number one trade partner followed by u.s. and hong kong in the second and third position .
- japan is still the largest trading partner , followed by the united states and hong kong .
-
- " yes , captain , although by this time i ought to have accustomed myself to be surprised at nothing since i have been on board your boat . "
- " yes , captain , although since i've been aboard your vessel , i should have formed the habit of not being amazed by anything ! "
Abstractive corpus
Abstractive corpus by Cohn & Lapata containing 575 pairs of original and compressed sentence. Examples:
- Thousands of Namibians stood in long queues , stretching nearly a mile in some cases , as they waited to vote on the first day of Namibia 's five-day pre-independence election yesterday .
- Thousands queued to vote yesterday on the first of five days in Namimbia 's pre-independence election .
- The length of the queues in Windhoek and Katutura , two of the most densely populated urban areas in Namibia , triggered conjecture that the poll might have to be extended an extra day .
- The queues in Windhoek and Katutura triggered conjecture that the poll might be extended .
- Namibia 's more than 700,000 residents are voting for a 72-member constituent assembly to draw up an independence constitution and prepare the way for full independence next year , perhaps as early as April .
- The 700,000 Namimbians are voting for a 72-member assembly to draft a constitution preparing for full independence next year .
TO BE EXPLORED
- Barzilay & McKeown (2001), translations of books, 26,201 aligned sentences
- Pang et al. (2003), Translations of news articles, 109,203 sentences
- Quirck et al. (2004), Clusters of news articles, 153,403 sentences
- Clarke & Lapata (2008), extractive sentence compression corpus, 1433 + 1370 sentences