Progress report 3 - ShelinaGobardhan/FOTD-Portfolio- GitHub Wiki

Making frequency lists of the words in the data sets was easy. We scanned through the words and we already came up with some nice ideas of words comparing between the book and the movie script.

The first idea was making a comparison of the characters. It’s obviously that Harry occurs the most in the book as in the script, but what of the other characters? Do Ron, Hermione, Draco, Voldemort, Sirius etc occur as many times in the book as in the movie script. And what does this tell us about the screen time of the actors?
The second idea was looking at the data from a gender issue point of view. Was is the balance between he/she, men/women, and more interestingly wizard/witch?
The last idea was making a comparison of the locations, for example: Hogsmeade, Shrieking Shack and Whomping willow. What does this tell us about the balance between the locations in the book and the scenes locations in the movie?

Creating the script

But then the more difficult part of comparison started. How do we create a script that runs the data of both sets at the same time, next to each other, and only with the words we want to compare? After a couple hours struggling we managed.

With this command we created a frequency list of both data sets:

sdiff azkabanbook-wordfreq.txt azkabanscript-wordfreq.txt > wordfreqdif.txt

cat wordfreqdif.txt.

However zooming in on the data the focus in this case lies in finding out whether the appearance of characters in the book concurred with the appearance on screen. For example, if in the book the person on which is focused most is Harry, after that Ron and the least focus was on Hagrid, does this then mean that the same is the case in the movie? Or does perhaps Dumbledore get the least screentime? In order to find the answer to this question the appearcance of the names of the different characters in both the book and the movie script will be compared in order to come to a conclusion.

With the current data sets the following command was used to test whether the desired output was given:

grep -E 'harry|hogwarts|voldemort' azkabanbook-wordfreq.txt azkabanscript-wordfreq.txt

azkabanbook-wordfreq.txt: 9 harry azkabanbook-wordfreq.txt: 2 harrys azkabanbook-wordfreq.txt: 1 boggartvoldemort azkabanbook-wordfreq.txt: 1852 harry azkabanbook-wordfreq.txt: 178 harrys azkabanbook-wordfreq.txt: 92 hogwarts azkabanbook-wordfreq.txt: 37 voldemort azkabanbook-wordfreq.txt: 10 voldemorts azkabanscript-wordfreq.txt: 690 harry azkabanscript-wordfreq.txt: 2 harryhermioneron azkabanscript-wordfreq.txt: 81 harrys azkabanscript-wordfreq.txt: 32 hogwarts azkabanscript-wordfreq.txt: 1 skyhogwarts azkabanscript-wordfreq.txt: 6 voldemort

The script did not give the desired output, so another script was used with the original data sets:

cat azkabanbook.txt | tr -sc '[A-Z][a-z]' '[\012*]' | sort | uniq -c | awk '{print $2 "\t" $1}' >     azkabanbook.word_freq.csv

cat azkabanscript.txt | tr -sc '[A-Z][a-z]' '[\012*]' | sort | uniq -c | awk '{print $2 "\t" $1}' > azkabanscript.word_freq.csv

cat azkabanscript.word_freq.csv azkabanbook.word_freq.csv | grep -E 'Hogwarts|Harry|Voldemort' | column -t

Harry 543
Hogwarts 13
Voldemort 6
Harry 2034 Hogwarts 92 Voldemort 48

It worked! Now we can also use the script to run the other comparisons.