Week 4 5 W48 49 30.11 14.12 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
###Summary:
- Downloading Script for NGrams: During this week a script has been finalized that downloads and at the same time cleans the tables using the cleaning strategies developped during the last weeks.
- Challenge: Running the script yields Error : "IOError: [Errno 122] Disk quota exceeded". Solution to Issue "Upload big amounts of data to the server" will be tried out during this week.
- Cleaning Strategy: The strategy developed during the last weeks is now fully implemented and included in the download and clean script.
- Development and Implementation of a classification algorithm: A classifier was developed for categorizing the Shakespeare's plays
The script uses the "google_ngram_downloader" API to download the tables from Google NGrams. Iteratively the zip-file of each table is downloaded, extracted and saved. The zipfile will be deleted afterwards so minimal memory is needed to store the tables. Running this script to download and clean ALL uni- and bigrams of British English will be very time consuming. However, a solution to the Issue https://github.com/Rostlab/DM_CS_WS_2016-17/issues/13 has been published on Sunday evening and is therefore not yet used. This will be a goal for next week. The current script can be found in the repository of this group (https://github.com/brosequartz/GoogleNGramDataset/blob/master/_Cleaning.py)
The cleaning strategy that the group has develeoped until last week is now fully implemented. Now, there exists a class that has four cleaning functions:
- rough cleaning: This is the first step of cleaning by comparing the table to a list of stopwords and an english dictionary. The stopword list consists of the one from the nltk library of python (see http://www.nltk.org/) and all names of speakers in the Shakespeare plays. Grams that are in that list are cleaned from the table. After that, the table is compared to a dictionary to get rid of NaW's or NaT's (Not-a-Term in the case of bigrams), respectively.
- frecuency cleaning: In order to leave out very common grams, for each table a threshold is set which is determined by the length of the table divided by the maximum count. All grams that have total counts higher than the threshold are cleaned out.
- leters: In the NGram-tables there are many words listed case-sensitively, meaning there are new lines if a unigram appears in upper- and in lowercase letters. The normalized counts (normalized: division by volume count) are added up and just one line is used. Note that hereby, the size of the table is also reduced.
- types: This is not an actual cleaning method. Rather, the word types that are often attached to the grams after an "_" are extracted and written in the next column for later use.
Of course the class has a constructor which takes a filename as input and reads it into a pandas dataframe (see http://pandas.pydata.org/ for a reference of the python pandas library). The threshold described before is calculated and set as an attribute, also the nltk stop word list is defined as an attribute.
The Shakespeare plays were classified by Shakespeare into three categories: comedy, history and tragedy. This week our team implemented algorithm for classification of plays into these categories based on uni-grams. We think that the classification algorithm can be extended to datasets of bigger size and of different content. The Shakespeare dataset contains only 36 plays, so they were divided into training and test datasets of 30 and 6 plays, accordingly. The classifier is not provided with the categories of the plays from test set, and classifies them based on plays from the training set. The test dataset contains 6 plays, 2 randomly chosen from each category:
- History: "King John","Richard II"
- Comedy: "Twelfth Night", "The Winter's Tale"
- Tragedy: "Hamlet", "King Lear"
The training set contains all other plays. The idea of the classifier is to create uni-gram profile of each category from the training set and compare them to the profile of the play to be classified. The category profile contains all uni-grams from the plays in this category, i.e the words and total counts. Similarly, the play profile contains all uni-grams from that play. The uni-grams were cleaned from the stopwords beforehand. To compare the profiles different measures can be used. We decided to start with simple distance measure, it is called out-of-order distance [1]. We firstly sort the unigrams in both categories and play profiles by the total count in descending order. Then for each uni-gram in the play profile out-of-order distance to each category is calculated as following:
d = |ind play - ind cat|
So, the out-of-order distance for each uni-gram is the absolute difference between indices in the category and play profiles. For example, if the word "love" is in position 3 in the play profile, which means it is the third most frequent word in this play, and it is in position 10 in the category history, then the out-of-order distance from the play to history is 7. If the uni-gram from play is not contained in the category, distance is set to maximum, i.e the length of the category profile. The out-of-order distance of all uni-grams from the play are added in the end. The classifier outputs the category to which the total distance is the smallest. With this classifier 5 plays were classified correctly,one play "A Winter's Tale" was incorrectly classified to tragedy.
To improve the classifier we tried to use another distance measure, which takes into account the total counts of uni-grams. The profiles for categories and plays were used the same. The new measure is frequency distance measure [2]:
d = (2*(Count play - Count cat)/(Count play +Count cat))2
This measure has the form of relative distance, where Count is the total count of the certain uni-gram in the play and category profiles. With this measure also 5 plays were classified correctly, but the wrong play was "King John". It was classified into comedy, instead of history. For code, please look here
For the next week our goals are:
- test the existing classifiers by dividing the dataset differently, for example by cross-validation.
- combine two measures, and classified based on both of them
- extend classifiers to use bi-gram data
- Wordcloud for "History" ![History] (http://i.imgur.com/8f2Wh7e.png)
- Wordcloud for "Comedy" ![Comedy] (http://i.imgur.com/g3nKZiQ.png)
- Wordcloud for "Tragedy" ![Tragedy] (http://i.imgur.com/CAV4Fc8.png)
References:
[2] [Graovac, Jelena. "Text categorization using n-gram based language independent technique." 35th Anniversary of Computational Linguistics in Serbia, Book of Abstracts, to appear in the Proceedings of the Conference, 2013.] (http://poincare.matf.bg.ac.rs/~jgraovac/publications/jg35CLS.pdf)