cnn dailymail stats info - lngvietthang/das GitHub Wiki

#CNN/Dailymail Dataset Statistical Information:

cnn dailymail
Number of Document 92454 219484
The Content Vocabulary Size 281460 495313
The Highlights Vocabulary Size 87696 169412
The Average Number of Word in Content 672 717
The Average Number of Word in Highlight 45 54
The Average Number of Sentences in Content 34 35
The Average Number of Sentences in Highlight 4 4
The Average Number of New Words in Highlight 9 8
Some dataset's statistical information:
Number of Documents
Vocabulary Size
Average number of words
Average number of sentences
Average number of new words
Distribution of Content Length
Distribution of Highlight Length