Week 7 W2 4.01 11.04 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

###Summary:

Categorization by scenes : The categorization algorithm was tested to classify scenes from Shakespeare play dataset.
Downloading and cleaning : Unfortunately, we couldn't get results yet due to problems with accessing the server last week and unexpected termination of screens, which were running the script. Currently, we started the downloading and cleaning scripts again.
Word prediction : We started working on developing an algorithm for following word prediction.

###Categorization by scenes:

From last week we have the Shakespeare dataset divided by scenes and tagged with genres. In total we have 702 different scenes in all plays. As before we divided the dataset into test and training parts, 20% and 80% respectively. We used the same algorithm with two distance measures, which was explained in the previous weeks. We again create a profile for each category and each scene from the test set, and compare each scene profile with all category profiles. This week the two distance measures were combined into one, simply by weighted sum of both in order to make them of same scale. We run the algorithm 5 times, each time using different parts as training sets. The script can be seen here. In total around 70% of the scenes were classified correctly.

As a comparison we tried built in classic classifiers and text feature extractors from scikit-learn python library on the same "raw" dataset (containing only text and genre of each scene , not uni- and bi-grams). More about the library can be found here. The text of the scenes was converted into a bag-of-words vector using sklearn.feature_extraction.text library. Then two different classifiers were used: Multinomial Naive Bayes and Linear Support Vector Classifiers. The results were approximately 91% and 75% correctly identified scenes accordingly. The script can be found here. Thus we can conclude that our classifier has a comparable success rate.

In the next weeks, the algorithm will be modified to use the bi-grams data in addition to uni-grams. Also, for demonstration purposes we will work on creating a GUI in which a user could enter a scene from Shakespeare play, and get a genre as an output.

###Downloading and cleaning:

The Downloading is still progressing on the server. Unfortunately due to the memory issues, the running processes have been killed and had to be restarted. Since an error had occurred during the time we were not able to log in no cleaned tables are available yet.

###Word Prediction:

We decided to predict the word that follows a sample word in the two different contexts of our dataset. Once, we will use our algorithm to predict which word a sample will most likely be followed in a Shakespeare play and once we will predict the following word in terms of Google's NGrams.

We will write the algorithm in Python using appJar (http://appjar.info/) and create a GUI in which a user can type a word and get the following word of both contexts displayed.

Our first attempt to implement the algorithm will be based on Markov Chains. The idea is very well explained here: https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/. The algorithm calculates probabilities of the following words based on the bigram counts, and chooses the one with the highest probability. If the probabilities of multiple words are equal, it chooses randomly among them. We will first try the algorithm using Shakespeare play dataset, and then adopt it for Google's NGrams.

This algorithm can be used for our Tasks 4) and 2):

In task 2) we want to analyze by which words Shakespeare words could have been replaced according to Google's N-Grams. Our first idea was to always take the second word of the bi-gram and search for all words that precede that second word in Google's N-Grams. This will be extremely time consuming. With the Markov Algorithm we can establish a probability model of which words will most likely precede and follow our sample in both contexts - Shakespeare and Google's N-Gram database. We can then analyze the replacements over time.

In task 4) we want to predict predecessors and successors of samples. So we will fuse these two tasks and program a GUI which will give us results for these two tasks.

This week's presentation

http://prezi.com/aajft3a-uftv/?utm_campaign=share&utm_medium=copy&rc=ex0share