Week 2 W46 16.11 23.11 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

During this week the team focused on finishing the brainstorming and defining tasks that the semester project shall be composed of.

The semester project of our group will answer / be composed of two descriptive and two predictive tasks.

Descriptive Tasks

1 Evolution of Diction over time

In this task we want to look at the evolution of Shakespeare's diction over time.

a) After having cleaned the Shakespeare unigram database, that we created last week, from the outliers mentioned (articles, pronouns, stopwords, ...) some significant words are to be picked. By visualizing the unigrams in a wordcloud we can see which unigrams appear often. We want to pick the ten unigrams with the highest counts. After having extracted the significant words from the Shakespeare unigram database we will look for those unigrams in the British English version 2 unigram database. Since in that dataset the unigrams and their counts are sorted by the year they appeared we can now plot their appearance count over time and see if they appear less / more often in the following and preceding years. It should be mentioned here that Shakespeare's plays were written between 1589 and 1613. The Google Books database contains 8,116,746 books written in the period between 1500 and 2008. Therefore, we can also examine diction going in the other direction of time.

b) Once the unigrams have been evaluated we also want to depict the evolution of bigrams that conatin our significant words from a) and visualize their usage in the same way as for the unigrams.

2 Analysis of Word Replacements

In this task we want to analyze if any of the significant words chosen in task 1a) have been replaced by other words. Our strategy to answer this question is to extract the second words of all bigrams of 1b) and find all bigrams in the Google British English 2-Gram Version 2 database that have the same second word and look at their first words. we will extract the three most common predecessors of the successors of our significant Shakespeare words for each century. To visualize those, wordclouds will be generated again. So, by analysing the predecessors of the successors of our significant words we can make a statement whether the significant words have been replaced or are used in different contexts nowadays or in earlier times, respectively.

Predictive Tasks

3 Classification of Plays Based on 1- and 2-Gram Counts

In general, Shakespeare's plays can be categorized into three different groups: Tragedy, comedy and histories. This categorization has been introduced by Shakespeare himself. We want to analyze if the plays can be categorized into these three groups by looking at the times of appearance of uni- and bigrams. We will categorize the plays of our Shakespeare training sub-database based on this webpage's categorization: https://en.wikipedia.org/wiki/Shakespeare's_plays. Then we will, for each category, extract all uni- and bigrams of that category. Now we have clustered our Shakespeare test uni- and bigrams by their play category. The result will be three databases of the form of the Shakespeare uni- and bigram. This will only be done for the test subset. We will find an algorithm that can now categorize the plays in our test dataset according to the counts of appearance of their uni- and bigrams. The goal is to write this algorithm so that it could possibly be used to categorize any play into one of those categories, if the training set is extended.

4 Prediction of Predecessors and Successors of Words

In this task we want to predict the predecessor and the successor of a word with a fixed certainty c. In order to do this, we will need to establish a probability model P such that P('word') = ('Predecessor', 'Successor'), meaning that 'word' will be preceded by 'predecessor' and followed by 'successor' with certainty c.

Possibly we can combine this task with task 3) and predict which will be predecessor and successor of a unigram given in which play of Shakespeare it appears.

General Tasks

Python Scripts:

Script to extract uni- and bigrams from the Shakespeare play database. The result should be two csv-files, one containing all unigrams from Shakespeares play, the other all bigrams. These will be called the Shakespeare unigram and the Shakespeare bigram database.
Script to detect stop words. The stop word list will include words that appear more a number times for both datasets. The highest number of appearances will differ from the Shakespeare set to the Google set and be adjusted to the sizes of the datasets. The list will be extended by the mentioned outliers.
Scripts for Visualization: Wordclouds to find significant Shakespeare words, Graphs for evolution of diction and replacement of words.
Need to find: Visualization for task 4).

This Weeks Achievements

The rest of this week was spent on working on Descriptive Tasks i.e Shakespeare's diction over time. Python Scripts:

* Script to extract all uni- and bigrams from the Shakespeare database

Firstly, all the text of the Shakespear's play is transformed into the Unigram to make it usable for analysis. Now we have the Unigrams and their repetition in the CSV format. But this data is not clean and has a lot of pronoun and other common words.With the help of NLTK stop word list we have removed the common words which aren't of our interest.

* Script to search Shakespeare uni- and bigram as well as the Google's British English uni- and bigram Version 2 databases for given uni- or bigrams.

* Searching the most common words after removing stop words in Shakespear's dataset

we have removed the most common words and now we want to analyze the top ten most common that Shakespear used in his playwriting. We used Pandas library to analyze the data frame and following are the most common use words.

Word Frequency

thou 4431
thy 3390
shall 3078
thee 3013
good 2221
lord 2097
would 2003
enter 1927
man 1926
love 1892

* Searching the most common Shakespear words in the Google Uni Gram

After getting the most common words from the Shakespeare dataset, we want to analyze how their usage has been changed. The analysis will help us to understand how the word usage increase or the period of time. For this purpose, we would require Google's Ngram version 2 British English dataset. As William Shakespear was British, it makes sense to search in British English Google N grams. Most frequent words of William Shakespear are thou, thy, shall, thee, good, lord, would, enter, man, love. We search these 10 words in the specific alphabet of British English version 2 database. Following are our findings:

Thou in Google's British English Version 2 "T" UniGram
Thy in Google's British English Version 2 "T" UniGram
Thee in Google's British English Version 2 "T" UniGram

Shall in Google's British English Version 2 "S" UniGram
Good in Google's British English Version 2 "G" UniGram
Lord in Google's British English Version 2 "L" UniGram
Would in Google's British English Version 2 "W" UniGram
Enter in Google's British English Version 2 "E" UniGram
Man in Google's British English Version 2 "M" UniGram
Love in Google's British English Version 2 "L* " UniGram

* Problems in Shakespeare Plays dataset

Shakespeare dataset contains the text of all plays, and includes the words of no interest for us. These words are usually prepositions, pronouns and other words with little meaning, for example, I, you, of, etc. The Shakespeare dataset was cleaned up using a list of common stopwords in Enlish language available in Natural Language Toolkit. Another cleaning was done to remove the names of characters, places and divisions, they were usually written in the capital letters. For example, ACT, SCENE, ROMEO, etc. However, some stopwords specific to this dataset still remain. For example, Enter(command for the actors of the play), Page(page number), etc. Also, characters with no name are not identifiable from the text, for example, Nurse, Servant, etc. However, they appear quite often in dialogs. We tried cleaning using threshold for the word frequency in the dataset. However, that way some non-stopwords are also deleted, such as thou(archaic you). Another technique could be to elongate the list of stopwords manually, but it requires a lot of time.

Another issue is to find "significant" words, for example words that were invented by Shakespeare or words that became obsolete. The most common words used by Shakespeare are common to everyday use in his time and now. For example, love, man, good, etc. These word are not specific to Shakespeare, and do not represent his uniqueness. The team is in the process of developing another measure to find the "significant" words for Shakespeare.

Databases:

Shakespeare uni- and bigram database, built by the above-mentioned python script.

This Week's Presentation slides:

http://prezi.com/bkssq4aerjm8/?utm_campaign=share&utm_medium=copy