Dataset "Google's N Grams" - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Proposer: Imke Helene Drave - @ImkeHelene - [email protected]
Team: 0. Imke Helene Drave @ImkeHelene 0. Bibigul Shektybaeva @brosequartz
Votes: 1. @brosequartz, 2. @magiob, 3. @avradips, 4. @muhammadasad1, 5. @vivek-sethia

Summary

A data set that contains all N-grams for N from 1 to 5 taken from the Google Books corpus. It is a publicly available data set.
N-gram models can be used for various tasks such as spelling correction, text recognition, word breaking, text summarization and even machine learning.

In this course the team will work on the British English Version 2 Uni- and Bigram sub-dataset of the Google N-Gram database to. This dataset will be combined with a dataset that lists all dialogues of Shakespeare.

The final python scripts of this project can be found here:

https://github.com/ImkeHelene/NGrams_Final_Repo

Final Presentation: https://prezi.com/jqgsfldqk6jh/google039s-n-grams-dataset/

Prediction / Description Goals

How did diction change over the years?
Analysis of word replacements
Categrization of theatrical plays into categories history, tragedy and comedy
Given a word, which will it be followed / preceded by?

Weekly Progress

Week 1 W45 46 9.11 16.11 Google's N Grams -- Summary:

Shakespeare Plays dataset is proposed for text analysis based on Google N-Grams
This project will focus on British uni- and bi-grams
Uni-grams were extracted from Shakespeare Plays dataset
As first test subset uni-grams starting from letter t were explored
Visualisations for Google N-Grams and Shakespeare Plays datasets were constructed

Week 2 W46 16.11 23.11 Google's N Grams --Summary:

Formulation of semester project tasks
Evolution of diction over time
Analysis of Word Replacements
Classification of Shakespeare's Plays
Prediction of Predecessor and Successor of Words

Week 3 W47 23.11 30.11 Google's N Grams --Summary:

Enhancement of the cleaning strategy
Count normalization of uni- and bi-grams
New attribute of n-gram for part of speech
Attributes for categories for Shakespeare n-grams

Week 4-5 W48-49 30.11 14.12 Google's N Grams --Summary:

Downloading and cleaning all uni-grams <- difficulties here.
Implementation of characterization algorithm for Shakespeare's plays
One less group member

Week 6 W48-49 14.12 21.12 Google's N Grams --Summary:

Downloading and cleaning all uni-grams <- Script started on server.
Change of structure of Uni- and bigram CSV files
Categorization by SCENES

Week 7 W2 4.01 11.04 Google's N Grams --Summary:

Categorization by scenes <- Results
Downloading and cleaning <- Problems with access to server
Following word prediction <- Ideas

Week 8 W3 11.01 18.04 Google's N Grams --Summary:

Work in progress: GUI & algorithm for "Following word prediction"
Categorization: Test the algorithm based on bi-grams & divide dataset by paragraphs(speeches)
Downlading and cleaning: Processes are killed -> Probably memory issue? Solution: Only download a few sets that contain interesting words.

Week 9 W4 18.01 24.04 Google's N Grams --Summary:

Word prediction: Work in progress
Categorization: prediction of play given several lines
Downlading and cleaning: New try with use of python csv reader instead of python pandas library

Week 10 W5 24.01. 1.02. Google's N Grams --Summary:

Downloaing and Cleaning: Strategy working.
Play writing: Multiple plays written.
Categorization: Categorization of the newly written plays
Word Prediction: Development of a GUI that predicts successor

Long Description

An N-gram is a sequence of N consecutive segments of a text, e.g. for the sentence "Everyday friendly cows eat grapes." All the 2-grams that can be found are:

Everyday friendly
friendly cows
cows eat
eat grapes

One can now search for all N-grams in an entire book or any other text corpora. What comes out is what is called an N-gram model that can be used for various tasks, as mentioned before.

From 2009, Google started to collect all 1- to 5-grams from the Google Books corpus and established a huge database.

The dataset contains multiple tables for various languages, including british and american english, english fiction, german, chinese, hebrew, ... .

The tables are sorted by N (1 to 5) and alphabetically. Within the tables, the N-grams are sorted chronologically and the items of the database are of the following structure:

_ngram_ TAB _year_ TAB _match-count_ TAB _page-count_ TAB _volume_count

So, if we have, for example, these two lines from a unigram model:

circumvallate   1978   335    91
circumvallate   1979   261    91,

we can read it as:

In 1978 the word circumvallate appeared a total of 335 times in 91 distinct samples from Google Books. Similar for 1979.

The entire set is very large (2.2 TB) but, it would not make sense to examine across languages. Also, it makes sense to answer the prediction questions to just pick one to two different N's. This would then yield a reasonably sized database (e.g. the british english bigrams data is approximately 100-120 GB of size).

Complementary Dataset

William Shakespeare Plays database will be used as a complementary dataset. This dataset will be used for analysis and classification. The database is a table in the CSV format. It contains a table that lists counts of dialogue parts per chapter, the chapter in which the dialogue part appears, who speaks in the dialogue part and the dialogue part itself. The dataset can be found here: https://datahub.io/dataset/william-shakespeare-plays

Links

The Google NGrams datasets can be downloaded here: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Shakespeare Plays dataset can be downloaded here: https://datahub.io/dataset/william-shakespeare-plays

There exists a python application that enables downloading user-defined subsets: https://pypi.python.org/pypi/google-ngram-downloader/