Week 8 W3 11.01 18.04 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

###Summary:

Categorization by scenes : The dataset was divided into paragraphs(speeches). The categorization algorithm was tested to classify scenes&paragraphs from Shakespeare play dataset based on bi-grams.
Downloading and cleaning : Unfortunately, the processes are still killed probably due to some memory issues. Solution: Pick significant words and download, clean the N-Gram datasets that contain these words
Word prediction : Developping the algorithm to predict following words and populating it into a GUI is work in progress. The idea for the algorithm has been refined and is now being implemented.

###Categorization by scenes:

Since categorization by scenes showed 90% correct results, we decided to divide the dataset to even smaller texts. This way the number of instances increases, but the number of attributes decreases. The dataset was divided into paragraphs, one paragraph being one person's speech. In total there are 29999 speeches. In the original dataset we have column with character names. Using uni-grams & Multinomial Naive Bayes classifier, the performance was 60% correctly classified paragraphs. However, if we merge small each speech with 1-2 lines with the speech following it, the performance of classifier improves to 75%. As a result we have 11250 paragraphs/instances. The resulting confusion matrices are(rows represent the true category, columns represent predicted category):

Classifier based on unigrams, categorization of scenes: http://i.imgur.com/NtcPfu5.png
Classifier based on unigrams, categorization of speeches: http://i.imgur.com/mmnB5lZ.png

Also we are working on classification based on bi-grams. Multinomial Naive Bayes classifier is also used for that. Unfortunately the results were better when using only uni-gram than when using both uni-gram & bi-gram data. The results for scenes is 83%, and for speeches 70%. The code can be found here. The other problem is that the extraction of bi-grams from test set is more time-consuming than uni-grams. The resulting confusion matrices are shown below:

Classifier based on bigrams, categorization of scenes: http://i.imgur.com/Xj6YDXI.png
Classifier based on bigrams, categorization of speeches: http://i.imgur.com/YyJ4BNW.png

For next week, classification by plays for scenes and paragraphs will be attempted.

Histograms for visualization:

http://i.imgur.com/MEVgLe6.png

http://i.imgur.com/T6tu525.png

http://i.imgur.com/yQVJYJX.png

http://i.imgur.com/BJQnLLP.png

###Downloading and cleaning: The processes started on the server are "killed" without any error message. According to the web this is a memory issue on the server caused by the memory usage of our Download and Clean script. Due to a lack of time to develop a new download and clean strategy an idea for a solution would be to pick significant words, based on Task 1) and only download those datasets on the team member's computers. This way we could predict successing words but no predecessors.

###Word Prediction: During the feedback of last week's presentation it has been discussed to automatically write a new Shakespeare play, based on following word prediction. The algorithm explained here: https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/, uses a dictionary to establish a list of successing words of a key. To build sentences it randomly picks a word from such a list - the possibility of words that occur more often to be picked is higher than that of less occurring ones. However, we would like to always pick the word most likely to be the successor. Therefore we decided to change the algorithm in such a way that it uses a tree where each node has it's followers as children and each node is assigned a possibility. This way we can pick the word with the highest possibility out of the children's list of a node. For a start we will only look at total counts and calculate

P(sample_word successed by word) = # of times word successes sample / # of successors of sample.

This week's presentation

https://prezi.com/ye07eq5ljfhg/google039s-ngrams-dataset/