Dataset "Google's N Grams" - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
- Proposer: Imke Helene Drave - @ImkeHelene - [email protected]
- Team: 0. Imke Helene Drave @ImkeHelene 0. Bibigul Shektybaeva @brosequartz
- Votes: 1. @brosequartz, 2. @magiob, 3. @avradips, 4. @muhammadasad1, 5. @vivek-sethia
Summary
A data set that contains all N-grams for N from 1 to 5 taken from the Google Books corpus. It is a publicly available data set.
N-gram models can be used for various tasks such as spelling correction, text recognition, word breaking, text summarization and even machine learning.
In this course the team will work on the British English Version 2 Uni- and Bigram sub-dataset of the Google N-Gram database to. This dataset will be combined with a dataset that lists all dialogues of Shakespeare.
The final python scripts of this project can be found here:
https://github.com/ImkeHelene/NGrams_Final_Repo
Final Presentation: https://prezi.com/jqgsfldqk6jh/google039s-n-grams-dataset/
Prediction / Description Goals
- How did diction change over the years?
- Analysis of word replacements
- Categrization of theatrical plays into categories history, tragedy and comedy
- Given a word, which will it be followed / preceded by?
Weekly Progress
Week 1 W45 46 9.11 16.11 Google's N Grams -- Summary:
- Shakespeare Plays dataset is proposed for text analysis based on Google N-Grams
- This project will focus on British uni- and bi-grams
- Uni-grams were extracted from Shakespeare Plays dataset
- As first test subset uni-grams starting from letter t were explored
- Visualisations for Google N-Grams and Shakespeare Plays datasets were constructed
Week 2 W46 16.11 23.11 Google's N Grams --Summary:
- Formulation of semester project tasks
- Evolution of diction over time
- Analysis of Word Replacements
- Classification of Shakespeare's Plays
- Prediction of Predecessor and Successor of Words
Week 3 W47 23.11 30.11 Google's N Grams --Summary:
- Enhancement of the cleaning strategy
- Count normalization of uni- and bi-grams
- New attribute of n-gram for part of speech
- Attributes for categories for Shakespeare n-grams
Week 4-5 W48-49 30.11 14.12 Google's N Grams --Summary:
- Downloading and cleaning all uni-grams <- difficulties here.
- Implementation of characterization algorithm for Shakespeare's plays
- One less group member
Week 6 W48-49 14.12 21.12 Google's N Grams --Summary:
- Downloading and cleaning all uni-grams <- Script started on server.
- Change of structure of Uni- and bigram CSV files
- Categorization by SCENES
Week 7 W2 4.01 11.04 Google's N Grams --Summary:
- Categorization by scenes <- Results
- Downloading and cleaning <- Problems with access to server
- Following word prediction <- Ideas
Week 8 W3 11.01 18.04 Google's N Grams --Summary:
- Work in progress: GUI & algorithm for "Following word prediction"
- Categorization: Test the algorithm based on bi-grams & divide dataset by paragraphs(speeches)
- Downlading and cleaning: Processes are killed -> Probably memory issue? Solution: Only download a few sets that contain interesting words.
Week 9 W4 18.01 24.04 Google's N Grams --Summary:
- Word prediction: Work in progress
- Categorization: prediction of play given several lines
- Downlading and cleaning: New try with use of python csv reader instead of python pandas library
Week 10 W5 24.01. 1.02. Google's N Grams --Summary:
- Downloaing and Cleaning: Strategy working.
- Play writing: Multiple plays written.
- Categorization: Categorization of the newly written plays
- Word Prediction: Development of a GUI that predicts successor
Long Description
An N-gram is a sequence of N consecutive segments of a text, e.g. for the sentence "Everyday friendly cows eat grapes." All the 2-grams that can be found are:
- Everyday friendly
- friendly cows
- cows eat
- eat grapes
One can now search for all N-grams in an entire book or any other text corpora. What comes out is what is called an N-gram model that can be used for various tasks, as mentioned before.
From 2009, Google started to collect all 1- to 5-grams from the Google Books corpus and established a huge database.
The dataset contains multiple tables for various languages, including british and american english, english fiction, german, chinese, hebrew, ... .
The tables are sorted by N (1 to 5) and alphabetically. Within the tables, the N-grams are sorted chronologically and the items of the database are of the following structure:
_ngram_ TAB _year_ TAB _match-count_ TAB _page-count_ TAB _volume_count
So, if we have, for example, these two lines from a unigram model:
circumvallate 1978 335 91
circumvallate 1979 261 91,
we can read it as:
In 1978 the word circumvallate appeared a total of 335 times in 91 distinct samples from Google Books. Similar for 1979.
The entire set is very large (2.2 TB) but, it would not make sense to examine across languages. Also, it makes sense to answer the prediction questions to just pick one to two different N's. This would then yield a reasonably sized database (e.g. the british english bigrams data is approximately 100-120 GB of size).
Complementary Dataset
William Shakespeare Plays database will be used as a complementary dataset. This dataset will be used for analysis and classification. The database is a table in the CSV format. It contains a table that lists counts of dialogue parts per chapter, the chapter in which the dialogue part appears, who speaks in the dialogue part and the dialogue part itself. The dataset can be found here: https://datahub.io/dataset/william-shakespeare-plays
Links
The Google NGrams datasets can be downloaded here: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Shakespeare Plays dataset can be downloaded here: https://datahub.io/dataset/william-shakespeare-plays
There exists a python application that enables downloading user-defined subsets: https://pypi.python.org/pypi/google-ngram-downloader/