Week 6 W48 49 14.12 21.12 Google's N Grams - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
###Summary:
- Start Download and Clean on server: The DownloadAndClean python script has now been started on the server. We suspect it will finish downloading and cleaning the Uni- and Bigrams after Christmas.
- Change of Shakespeare Uni- and Bigram CSV-file: In order to be able to process Shakespeares grams in a better way it has been decided to change the structure of the CSV files - therefore the Python script that parses the Shakespeare CSV file has been changed.
- Enhancement of partition of Shakespeare Grams: For a start we will partition the dialogues of the original Shakespeare play CSV by scenes and count the grams per scenes.
The Python script to download and clean the NGram tables has now been started on the server. The script was also added methods to determine the amount of lines that are cleaned out. After each table it prints a percentage of how many lines have been cleaned. In the very end it averages the percentages.
The Script runs in two screens and downloads the unigrams and the bigrams from Google's database at the moment. The team hopes that the downloading process will be finished until next week - further estimations on how long downloading will take in total cannot be given at this point, though.
The CSV files containing the Unigrams and Bigrams have been created before. However, in order to categorize them and for further use in our descriptive and predictive tasks we came up with a better structure. The python script now parses the Shakespeare-play database and extracts the unigrams and their information in a table in the following way:
GRAM TAB TOTAL_COUNT TAB PLAY_COUNT TAB HISTORY TAB COMEDY TAB TRAGEDY TAB PLAY1_NAME TAB PLAY2_NAME ...
Meaning:
- Gram: Uni- or Bigram (Bigrams take two columns)
- Total_count: Number of times Gram appears in the Shakespeare database
- Play_count: Number of plays the Gram appears in
- History: Number of times the Gram appears in History plays
- Comedy: Number of times the Gram appears in Comedy plays
- Tragedy: Number of times the Gram appears in Tragedy plays
- Play1_Name: "Henry IV", number of times the gram appears in Henry IV
- Play2_Name: "Henry VI, Part1", number of times the gram appears in Henry VI, Part1
- ...
For the categorization algorithm it has been discussed to cluster the Grams into more subsets than just plays. We decided to split them by scenes also as there is a keyword SCENE <> in the Shakespeare_play dataset whenever a new scene starts. The new CSV structure explained above therefore may need to be extended to also list scenes. However the team is still thinking about a way to avoid very high column counts.