cnn dailymail stats info - lngvietthang/das GitHub Wiki
#CNN/Dailymail Dataset Statistical Information:
| cnn | dailymail | |
|---|---|---|
| Number of Document | 92454 | 219484 |
| The Content Vocabulary Size | 281460 | 495313 |
| The Highlights Vocabulary Size | 87696 | 169412 |
| The Average Number of Word in Content | 672 | 717 |
| The Average Number of Word in Highlight | 45 | 54 |
| The Average Number of Sentences in Content | 34 | 35 |
| The Average Number of Sentences in Highlight | 4 | 4 |
| The Average Number of New Words in Highlight | 9 | 8 |
| Some dataset's statistical information: | ||
![]() |
||
![]() |
||
![]() |
||
![]() |
||
![]() |
||
![]() |
||
![]() |






