Week 3 W47 23.11 30.11 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

###Summary:

  • Enhancement of the cleaning strategy: Google's databases are very large. Since they contain many NaW (Not-a-Word) words an elaborate cleaning strategy is needed. During this week a three-level method was developped and implemented.
  • Comparability of total counts: The total counts of uni- and bigrams in the Google database are strongly related to the volume count since a high volume count will cause a high total count. It was therefore decided to normalize the counts by dividing the total count by the volume count.
  • Categorization of words and plays: A script exists now that while parsing the uni- and bigrams from Shakespeare, places the plays in the three different categories. In the Google databases, the words often have a type attached, such as "tuke_VERB". A script exists now that separates gram from type and puts the type in the following column.
  • Scripts: All python scripts can now be found here: https://github.com/brosequartz/GoogleNGramDataset

###Cleaning strategy: The team found out that there are indeed words that do not contain vowels in the English language, such as "spy". The strategy to clean Shakespeare's unigrams from words without vowels therefore does not make sense.

The new cleaning strategy contains different functions to do different kinds of cleaning:

  • Stopword cleaning: A list of stopwords containing articles, pronouns, ... (the words mentioned in the summary of Week 1) was retrieved from the internet. The Shakespeare unigram database also contains names, since in a play speakers sometimes say them. Since the names of all speakers of the plays can be found in the 4th column of the original Shakespeare dataset a list of those names is given and was added to the existent list of stopwords.
  • Not-a-Word Cleaning: Cleaning the datasets of not-words (let's call them NaW). A database containing an english dictionary has been found here: (https://github.com/dwyl/english-words). This database is compared to the Shakespeare unigram database - after it was cleaned by using a list of stopwords. All words of that cleaned that do not appear in the dictionary are added to it, because it has to be ensured that all words that Shakespeare used are present in the dictionary. Once the dictionary is complete, we compare Goolge's uni- and bigrams to it. Any word that is not found in our dictionary will be cleaned out. This way we can ensure that no not-words are included in the Google databases.
  • Frequency cleaning: Words in Shakespeare and the Goolge databases that are not in the list of stopwords or appear in the dictionary but have counts that exceed numbers relative to the sizes of the databases and the highest counts of words will be cleaned out.

When the uni- and bigram databases of the Shakespeare plays are generated, stage directions are not counted. They are easy to detect because these do not have a number in the 4th column. Such lines are simply skipped during parsing.

Normalization of total counts

In order to be able to visualize evolution of diction over time different volume counts need to be taken into consideration. For example, if "thou" appears 10 times in year 1563 in 2 distinct books and 1000 times in 100 different books, we cannot compare those numbers. Therefore those numbers were normalized by dividing the total_count by the volume count in order to have comparable numbers. So if "thou" in 1963 appeared 10 times in 5 distinct books we have its normalized count 2.

This way it is also possible to "lowercase" every unigram.Looking at the previous example, and if "Thou" in 1963 appeared 3 times in 1 book we can treat "thou" and "Thou" as one unigram by adding the normalized counts. So we would count "thou" in 1963 2 + 3/1 = 5 times.

Visualization for normalized counts

Thou: As per The Columbia Guide to Standard American English "Most modern English speakers encounter "thou" predominantly in the works of Shakespeare and in the King James Bible". We indeed can see this phenomenon in the graph below. Clearly, the use of word "Thou" is high in the mid 17th century.

Thy: Plotting word thy ,which is top ten word from the Shakespear data set, in the Google's Uni Gram dataset. As we know, it's a quite old word and Its usage is high in the end of 16th century and we see a little spike again in the start of 19th century.

Thee: Plotting word thy ,which is top ten word from the Shakespear data set, in the Google's Uni Gram dataset.The Same phenomenon could be seen for the word "Thee" as it's a quite old word now its usage has been declined significantly.

Enter: Enter seems to be a regular word with regular usage over the period of time.But high steep in the year 1590 seems like an outlier but all play of Shakespear written in that period. We are not sure about the reason for this hike, let the linguist ponder for the reason of hike.

Would: Would has the same average usage over the period of time and indeed it's a common use word.

Shall: Usage of Shall was high till 18th century in the plot and for the rest of the century we have seen decline in it.

Good: Usage of good was high till 18th century in the plot and for the rest of the century we have seen decline in it.

Man Surprisingly usage of man is quite high in till the 18th century and then usage declined for the rest of the years.

Lord: Usage of "Lord" is quite high in the 18th century and then we see a decline in the usage.

Love: Strangely love is declining over the period of time! Here we plotted the word "love" in L GoogleUni gram and its usage has been declined.

Extraction of types

In the Google databases sometimes the uni- and bigrams are attached types. E.g. "tuke_VERB" appears in Google's unigram t- database. A python script separates the words from the types and saves the type in another column. The possible tags are as follows:

Issues

  • Google databases: The Google NGram databases are very large (~4bil lines). Downloading the tables, storing them and running any kind of python script on them is therefore very time and memory consuming. The team is still working on a proper downloading strategy and on speedups.
  • Shakespeare dataset: Extracting all bigrams and including the types of the plays they appear in is also very time consuming. The team is working on speeding up the parser.

This week's presentation

http://prezi.com/ffsto97a3wki/?utm_campaign=share&utm_medium=copy