Week 9 W4 18.01 24.04 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
###Summary:
- Categorization : identify play given several lines out of dataset
- Downloading and cleaning : The issue from last week arouse due to large memory consumption. Using python csv parser instead of pandas.DataFrame could solve the problem.
- Word prediction : Work in progress: Tree algorithm is being implemented, no results yet
- Automatic Poem Writer: Function that returns rhymed words is being implemented
###Categorization:
Based on the discussion from last week, we decided to test the Multinomial Naive Bayes Classifier based on unigrams for categorizing text by plays. The Shakespeare's plays dataset contains 36 plays, each play is a category in this case. The dataset then was divided into samples, each sample containing n lines. We tested cases with n=5, 10 and 20. An exception is the last sample of each play if total number of lines in the play is not divisible by n. In that case if the last sample contains more than n/2 lines it is treated as a separate sample, if less than n/2 lines it is appended to the previous sample. In total we obtained 21030 samples, 10515 samples and 5257 samples for n=5, 10 and 20 respectively. However, the results obtained were not very good. For n = 5, the 40% of the test samples were correctly classified, for n = 10 and 20 the scores were 50% and 55% respectively.
The next idea is two-step classification. In the first step the classification by genre is done, in the second step classification by play (only plays in this genre). Since the categorization by genre is quite accurate, classification by play should become more accurate. For example, instead of classifying into 36 categories, we have to classify into 14 (comedies), 12 (tragedies) and 10 (histories) categories depending on the outcome of the first categorization. However if the output of the first categorization is wrong, the second categorization is meaningless. With this strategy, the accuracy increased to 44%, 54% and 62%, for n = 5, 10 and 20 respectively. The new script can be found here.
###Downloading and cleaning:
During last week's presentation it has become apparent that the downloading issue is due to high memory consumption. We figured that the python pandas library that has been used in almost all scripts so far consumes much memory since it creates a pandas.DataFrame of the entire csv-file, which costs, of course, a lot of memory. The python csv parser parses a csv file line by line instead of pulling the entire file into memory and reading from there. So, the script has been changed using the python csv- parser (see more info about it here: https://docs.python.org/2/library/csv.html).
###Automatic Poem Writer: Since we would like to create an automatic poem writer, we need the lines to rhyme with each other. As a first simple approach, each line could have same number of words (~8 words) and each two lines can have rhymed last words. A function returning the rhymed words for a given word was written. It uses the CMU pronunciation dictionary from NLTK python library. The function looks up the pronunciation of a word, and finds the words that have the same pronunciation of the word end. This function can be combined with the next word prediction algorithm, so that the last words of each two lines rhyme. The script can be found here. The code was adapted from this page.