Week 10 W5 24.01. 1.02. Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

###Summary:

Generally it has been found that using python pandas is not the best way to handle our datasets due to their size. The python csv-reader has been used instead.
Downloading and cleaning : The issues have been resolved. Unfortunately the script takes a long time to download and clean all tables. Cleaning is started on relevant datasets.
Automatic Play Writer: Multiple poems have been written, that were categorized by the categorization algorithm
Word prediction : A GUI has been set up that works using the cleaned datasets.
Categorization : The new plays written by the Automatic Play Writer were classified into genres and plays using the algorithm from previous weeks.

In the beginning the group used the python pandas library to parse through all csv tables. However, since pandas creates a dataframe out of the tables, which means that the entire file is loaded into memory, this consumes too much memory. The python csv-reader is able to read a file line by line without loading it into memory entirely. The downloading and cleaning scripts have therefore been changed, using this csv reader which resolved the memory issues we were having before and resulted in a working script.

###Downloading and cleaning: During this last week the issues have finally been resolved and a working python script that can download the NGram datasets from the website and cleans them automatically exist. Unfortunately, since the course is now almost over, there is not enough time left to run the script on all datasets. The group therefore decided to download the datasets that include the "significant words" from the first description task and clean them.

###Automatic Play Writer: The automatic play writer wrote 10 different plays that the team was able to categorize using the categorization algorithm. For completeness reasons we want to explain the strategy used again: First of all, a python script parses the bigrams from the Shakespeare play database. An extract of the table is depicted here: http://i.imgur.com/2P1MqUX.png

This table is then parsed and a tree generated. The tree's nodes hold a word and all of its successors found in the Shakespeare plays, the number of times each successor occurs and the parent. Here, you can see a picture of the tree for the example "Everyday friendly cows eat grapes. Cows eat grapes every day. Cows stand on lawn all day.": http://i.imgur.com/pvpnFE7.png

In order to write a play that somehow resembles the existing Shakespeare plays, we needed some statistics on the plays. The python script that retrieves the information can be found here: https://github.com/ImkeHelene/NGrams_Final_Repo/blob/master/Shakespeare_scripts/PlayStatistics.py. The results are:

Average number of speakers per play = 26 (variance in [-17, 26])
Average number of speakers per scene = 2 (variance in [0, 11])
Average number of scenes per play = 21 (variance in [-11, 23])
Average number of speeches per scene = 15 (variance in [-10, 36])
Average number of sentences per speech = 1 (variance in [0, 2])
Average number of words per sentence = 13 (variance in [-11, 20]) These statistics have been retrieved from the Shakespeare play database by extracting arrays and taking the mean values as well as calculating a variance vector. Some of the results are depicted as graphics here: Number of scenes per Play: http://i.imgur.com/7JCfZeS.png

Number of Speakers per Play: http://i.imgur.com/uU03leY.png

Average number of speeches per speaker in each Shakespeare play: http://i.imgur.com/ggczgtm.png

Average number of words per speaker in each Shakespeare play: http://i.imgur.com/E3wNOdl.png

First, the speakers for the scene to be written are picked. The number of speakers is determined by the average + / - a random number within the variance range. Within loops the writer writes 21 scenes + / - a random value within the variance range of scenes. The speakers of this scene are picked form the play's speaker-array. It's length is determined by taking the average +/- a random value from the variance range. A scene will be composed of speeches. The number is again determined by taking the average +/- a random value from the variance range. The speakers are always randomly picked from the array containing this scene's speakers. The speech of a speaker will have the average number of sentences per speech +/- a random value from the variance range. The speech itself will be written using the tree. In order to prevent our play's sentences to all start with the same word, a random word is picked from the root node's children. From there, we always pick the children of the current node that has the highest count. Sometimes, it can occur that a node has a child that has it's parent as a child. To prevent ending up in an infinte loop we then pick the child with the second highest count. If this will also result in a loop, a random child is picked. In order to prevent very long sentences we stop each sentence after a number of words that is equal to the average number of words per sentence +/- a random value from the variance range. Otherwise, the sentences would become very long, because '.' - this is the "end of sentence"-node - does not have high children counts for each word. This is because many sentences end with different words.

By using this strategy 10 different plays were generated, that we were also able to classify.

###Word Predicition: A similar algorithm is now used to predict the successor of a word. Due to the time issues with downloading and cleaning the algorithm only works for words that can be found in the bigram tables of words starting with 'th', 'lo', 'ma', 'sh', 'wo', 'go' and 'en'. These tables were chosen because they include the significant words from the Shakespeare plays that we picked in the beginning. Using the above described algorithm a tree with words, successors and probabilities of successors is written into a file. In a GUI the user is able to enter a word and it will return the 3 successors that are most common within Google's NGrams according to the tree.

Categorization

The new plays written by the Automatic Play Writer were divided into samples each containing 3 rows. The samples were categorized into genres and plays using the Shakespeare dataset as training. The below figures show the results. For demonstration purposes a GUI was developed where a play and sample can be chosen and text and predicted genre and play will be displayed.

http://i.imgur.com/KX56tAf.png http://i.imgur.com/0e61gSU.png

To better see the lower values the two highest numbers were ommitted in the following image.

http://i.imgur.com/iqRVKAv.png

Presentation

https://prezi.com/jqgsfldqk6jh/google039s-n-grams-dataset/