Week 1 W45 46 9.11 16.11 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

0 - Summary:

During the first week the team focused on finding another dataset to match the n-grams with. A dataset has been found that lists all plays including the dialogues of Shakespeare. Since the main dataset - Google's N Gram database - focuses on words and word combinations, the team decided to analyze how Shakespeare's diction evolved over time. Tasks have been made up to use the N Gram database for analyzing how words used by Shakespeare are used nowadays or how their usage has evolved over time and whether or not words have been replaced by others in text corpora over time. The plays of Shakespeare can be categorized into histories, tragedies and comedies. The goal of mining the dataset is now to establish an application that can categorize plays of Shakespeare (and possibly other plays) according to the usage of uni- and bigrams.

Also, the team decided on analyzing the British English uni- and bigram version 2 (from July 2012) databases. These dataset were downloaded and transferred to our cluster accounts.

A few python scripts have been generated this week to parse the Shakespeare database and extract all unigrams and their counts and play-counts. Play-counts refers to the number of times a unigram appeared in one play.

Further discussions involved how to best visualize the datasets.

The team chose to pick the "t"-table from the British English unigrams to retrieve some first visualizations and in order to do some brainstorming. Again, this table has just been chosen paradigmatically for first idea collecting for what and how to visualize and for finding tasks. DURING THE SEMESTER ALL OTHER TABLES WILL BE EXAMINED.

1 - Weekly work

1.1 - Dataset organization & description

1.1.1 - Dataset summary

N Gram database

Google's N-gram database is a collection of tables sorted by language, N ranging from one to five, version of publication ( Version 1 was launched in July 2009, Version 2 in July 2012) and the alphabet. Within the tables an order is introduced by capital and small letters as well as type of word such as NOUN, PRON, etc. Listed are the N-Gram, the year in which it appeared, how many times it appeared in Google Books that year, on how many pages of Google Books objects it appeared and in how many different objects of Google Books it appeared in that year. As mentioned in the first summary the team chose to work on the version 1 (decoded 20120701) of the British Enlgish uni- and bigrams.

William Shakespeare Plays database

The database is a table in the CSV format. It contains a table that lists counts of dialogue parts per chapter, the chapter in which the dialogue part appears, who speaks in the dialogue part and the dialogue part itself. Also director advises are included. The team decided to parse this dataset and extract all uni- and later on also all bigrams from the dialogues and count their overall appearance. The dataset can be found here: https://datahub.io/dataset/william-shakespeare-plays

A Python script has been written to parse the CSV-table and extract all unigrams in the dialogues. The script counts the total appearance in all dialogues, it registers the plays in which the unigrams appears and how many times they appear within each play. We refer to this new database as transformed Shakespeare database. It is in the CSV-Format and the columns are of the following form: Unigram TAB Totalcount TAB Play1 TAB Playcount1 TAB Play2 TAB Playcount2 TAB ...

1.1.2 - Dataset description

Google's N-Grams

  • Google's UniGram Size: 3 GB
  • Google's BiGram Size: ~60 GB
  • Total Size: ~64 GB
  • Attributes: N-Gram, total-count, page-count, volume-count
  • Rows: > 4 Million
  • Format: Text. The team is considering conversion to CSV.

William Shakespeare Plays

  • Size: 10 MB
  • Attributes: CSV-row count, Play, dialogue-part count, chapter, speaker, dialogue-part
  • Rows: 111396
  • Format: CSV

Transformed Shakespeare Plays

  • Size: 3.6 MB
  • Attributes: Unigram, totalcount, plays, playcounts
  • Rows: 27625
  • Format: CSV

1.2 - Descriptive analysis of the instances and features

1.2.1 - Numeric Attributes

N-Gram database

The numeric attributes of both datasets are simple integer counts representing the frequencies. In the Google's N-gram, a frequency of word and repetition of that word in distinct books has been mentioned.

1.2.3 - Text Attributes

Some of the text has been marked by following tags marking the type of the word or the context in which the words were used.Some of the data is not consistently marked by Tags. We are still discussing how can we use this information for instance, pronouns and determinants are of no use which could be addressed in data cleaning phase.

noun Noun

_verb Verb

ADJ Adjective

ADV Adverb

PRON Pronoun

DET Determiner or Article

ADP An ad position

NUM Numeral

CONJ Conjunction

PRT particle

ROOT root of the parse tree

START start of a sentence

END end of a sentence

1.3 - Missing values

Google's N-Grams

In the documentation of this dataset the volume count was mentioned. However, this value is missing for many of the uni- and bigrams. Until now, this information is not relevant for our tasks which is why this data will be negligible.

William Shakespeare Plays

As the Shakespeare database also included headlines such as ACT I which are not spoken by characters, these lines are missing the dialogue count, chapter and speaker column information. However, so far we do not consider this information

1.4 - Outliers

Google's N-Grams

The N-Gram database at first sight seemed pretty clean. However the tables include many outliers, like abbreviations. Since English words always contain at least one vowel, it has been decided to use a vowel filter as a first rough cleaning filter. Also, there are very common words (= unigrams ) that appear many times in any text corpora. Such words are

articles (definite, indefinite):

  • the
  • this
  • that
  • these
  • those

pronouns(possesive, reflexive, relative and personal):

  • I, me, my, mine, myself
  • you, your, yours, yourself
  • he, him, his, himself
  • she, her, hers, herself
  • it, its, itself
  • we, us, our, ours, ourself
  • you, your, yours, yourselves
  • they, them, their, theirs, themselves

prepositions:

  • a, an ("per"), about, above, across, agains, among, like, as, at, of, for, except,...

conjunctions:

  • and, because, but, for, if, or, when, ...

determiners:

  • every, many, ...

If you do not know anything about types of english words, please refer to this webpage: https://en.oxforddictionaries.com/grammar/word-classes-or-parts-of-speech.

Although the documentation of the dataset mentions that the datasets have been cleaned the extent is not enough for our purposes. Google did not consider n-grams that appeared more than 40 times in ONE corpus - this means more than 40 times in one piece of text.

Every official word of the english language contains a vowel. Abbreviations such as Mr. or Mrs. we do not consider because our focus shall lie on diction (https://en.oxforddictionaries.com/definition/diction) and evolution of interesting words over time. By interesting we consider certain nouns, verbs or adjectives who have been used or invented by Shakespeare and that are not known, used or used in different contexts nowadays. Also, we do not want to investigate bigrams that are composed of such words.

William Shakespeare Plays

After parsing the uni- and bigrams from the Shakespeare plays, of course this dataset also holds many words of the types mentioned above which need to be filtered.

1.5 - Insights and ideas

Related Variables:

  • Match words and year / decade
  • Match bigrams and year / decade
  • Match Shakespeare grams and type of play

Clustering:

  • Cluster unigrams according to the word type: Nouns, adjectives, pronouns...
  • Cluster uni- and bigrams according to decade
  • Cluster Shakespeare plays according to type of play
  • Cluster Uni- and Bigrams based on their appearance in play-types

Possible prediction tasks:

  • Given a certain word predict which words it will be followed by most likely. Possibly predict from the bigram in which context it could appear.
  • Given a play by Shakespeare predict its type based on which uni- and bigrams are used in it. (We want to write an application that could possibly do this for Google Books. Since we do not have access to that database we will try our application on the Shakespeare "subset" of Google Books)

Visualization:

Google UniGram "T" data

Initial visualization of T data set has been provided and it's intuitive that most common word under T unigram is "The". Again, we picked this dataset pragmatically for brainstorming and first visualization purposes. THE REST OF THE DATASET WILL ALSO BE EXAMINED. Following is the visualization of our Google Unigram of T dataset.

  • Most of the data under "T" is occupied by Pronouns Most of the data is occupied by Pronouns

  • Most common word under T Google's Unigram is "The" Most common word under T unigram is "The"

William Shakespeare Plays

  • The most commonly used words we will display in word clouds. All plays

  • The most commonly used words in specific play:

  • Romeo and Juliette ![Romeo and Juliette] (http://imgur.com/GwdAbTl.png)

  • A Winters Tale ![A Winters Tale] (http://i.imgur.com/iRWuWFO.png)

  • Richard II ![Richard II] (http://i.imgur.com/6NYfFDj.png)

  • Frequency distribution chart for the top 10 (cleaned) words used in Shakespeare's plays: Top 10 (cleaned) words frequency distribution

  • Evolution of diction over time will be displayed in graphs and bar graphs.

Tasks:

For next week:

  • Find cleaning algorithms for N-Gram and transformed Shakespeare database.
  • Remove meaningless ngrams by comparing them to a British dictionary
  • Cluster Shakespeare plays and grams by type of play

Other:

1.6 - Weekly Presentation

https://docs.google.com/presentation/d/1D2s0koGpOXJ8rD-AitHzWjmHKBR5pxwojWkh6cpd7S0/edit#slide=id.g192322b868_0_14