Lesson 02 ETL Tutorial 01 Amazon Movie Reviews - adparker/GADSLA_1403 GitHub Wiki
Let's go over a simple ETL session with Amazon Movie Review Data.
Get the data:
andrews-mbp:data_science andrew$ curl 'https://raw.github.com/adparker/GADSLA_1403/master/src/lesson02/movies.small.txt' > movies.small.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6383k 100 6383k 0 0 1437k 0 0:00:04 0:00:04 --:--:-- 1437k
Take a look at the data. Here's a way to look at the first 20 lines, with each line truncated to 60 characters:
andrews-mbp:data_science andrew$ head -n 20 movies.small.txt | cut -c -60
product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For Th
review/text: Synopsis: On the daily trek from Juarez, Mexico
product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by P
review/text: THE VIRGIN OF JUAREZ is based on true events su
product/productId: B003AI2VGA
review/userId: A1I7QGUDP043DG
andrews-mbp:data_science andrew$
Can you describe in words the data we're looking at? How is it formatted?
It's a list of movie reviews. Each review consists of a number of attributes, each attribute per line. Importantly, the review/text
attribute is a single line. Thank goodness! This should be easy to parse.
This is a really quick sanity check to make sure we understand the data format in Python. We will go over this again step-by-step. But for now, let's take a look at this:
# Pretty printer
import pprint
list_of_dicts = []
dict_of_data = {}
# Opens a handle to the file, but does not read anything yet.
fh = open('movies.small.txt')
# This is more memory efficient.
# Reads one line at a time.
for line in fh:
# Strip off leading and trailing white-space.
line = line.strip()
if line != '':
# Assume it's safe to split:
label, data = line.split(': ', 1)
dict_of_data[label] = data
else:
# We're at the end of a record. Everything we want is in the
# dictionary.
list_of_dicts.append(dict_of_data)
# Look at the first three items.
pprint.pprint(list_of_dicts[0:3])
Open a file-handle. By default, it's in "read only mode".
>>> fh = open('movies.small.txt')
Read in the entire file as a list of lines. Only do this if you know you have enough memory on your computer. We're doing it this way to make the rest of the tutorial easier.
>>> lines = fh.readlines()
In production, you want to make sure to close any file handles you open. If you're opening lots of files in a long-running process, not closing file handles can slowly leak memory. :(
# Not super necessary in this case, but in general you should close things you open. :)
>>> fh.close()
Take a look at the first line:
>>> lines[0]
'product/productId: B003AI2VGA\n'
How about the first 9 lines, from item 0 until but not including item 10?
>>> lines[0:10]
['product/productId: B003AI2VGA\n', 'review/userId: A141HP4LYPWMSR\n', 'review/profileName: Brian E. Erland "Rainbow Sphinx"\n', 'review/helpfulness: 7/7\n', 'review/score: 3.0\n', 'review/time: 1182729600\n', 'review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"\n', 'review/text: Synopsis: On the daily trek from Juarez, Mexico to El Paso, Texas an ever increasing number of female workers are found raped and murdered in the surrounding desert. Investigative reporter Karina Danes (Minnie Driver) arrives from Los Angeles to pursue the story and angers both the local police and the factory owners who employee the undocumented aliens with her pointed questions and relentless quest for the truth.<br /><br />Her story goes nationwide when a young girl named Mariela (Ana Claudia Talancon) survives a vicious attack and walks out of the desert crediting the Blessed Virgin for her rescue. Her story is further enhanced when the "Wounds of Christ" (stigmata) appear in her palms. She also claims to have received a message of hope for the Virgin Mary and soon a fanatical movement forms around her to fight against the evil that holds such a stranglehold on the area.<br /><br />Critique: Possessing a lifelong fascination with such esoteric matters as Catholic mysticism, miracles and the mysterious appearance of the stigmata, I was immediately attracted to the \'05 DVD release `Virgin of Juarez\'. The film offers a rather unique storyline blending current socio-political concerns, the constant flow of Mexican migrant workers back and forth across the U.S./Mexican border and the traditional Catholic beliefs of the Hispanic population. I must say I was quite surprised by the unexpected route taken by the plot and the means and methods by which the heavenly message unfolds.<br /><br />`Virgin of Juarez\' is not a film that you would care to watch over and over again, but it was interesting enough to merit at least one viewing. Minnie Driver delivers a solid performance and Ana Claudia Talancon is perfect as the fragile and innocent visionary Mariela. Also starring Esai Morales and Angus Macfadyen (Braveheart).\n', '\n', 'product/productId: B003AI2VGA\n']
This is kind of ugly. Let's pretty print it.
>>> pprint.pprint(lines[0:10])
['product/productId: B003AI2VGA\n',
'review/userId: A141HP4LYPWMSR\n',
'review/profileName: Brian E. Erland "Rainbow Sphinx"\n',
'review/helpfulness: 7/7\n',
'review/score: 3.0\n',
'review/time: 1182729600\n',
'review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"\n',
'review/text: Synopsis: On the daily trek from Juarez, Mexico to El Paso, Texas an ever increasing number of female workers are found raped and murdered in the surrounding desert. Investigative reporter Karina Danes (Minnie Driver) arrives from Los Angeles to pursue the story and angers both the local police and the factory owners who employee the undocumented aliens with her pointed questions and relentless quest for the truth.<br /><br />Her story goes nationwide when a young girl named Mariela (Ana Claudia Talancon) survives a vicious attack and walks out of the desert crediting the Blessed Virgin for her rescue. Her story is further enhanced when the "Wounds of Christ" (stigmata) appear in her palms. She also claims to have received a message of hope for the Virgin Mary and soon a fanatical movement forms around her to fight against the evil that holds such a stranglehold on the area.<br /><br />Critique: Possessing a lifelong fascination with such esoteric matters as Catholic mysticism, miracles and the mysterious appearance of the stigmata, I was immediately attracted to the \'05 DVD release `Virgin of Juarez\'. The film offers a rather unique storyline blending current socio-political concerns, the constant flow of Mexican migrant workers back and forth across the U.S./Mexican border and the traditional Catholic beliefs of the Hispanic population. I must say I was quite surprised by the unexpected route taken by the plot and the means and methods by which the heavenly message unfolds.<br /><br />`Virgin of Juarez\' is not a film that you would care to watch over and over again, but it was interesting enough to merit at least one viewing. Minnie Driver delivers a solid performance and Ana Claudia Talancon is perfect as the fragile and innocent visionary Mariela. Also starring Esai Morales and Angus Macfadyen (Braveheart).\n',
'\n',
'product/productId: B003AI2VGA\n']
Better! Notice that each line ends in a new line, one of the lines is simply '\n'. Let's strip all the lines of leading and trailing white-space.
>>> lines[0]
'product/productId: B003AI2VGA\n'
>>> lines[0].strip()
'product/productId: B003AI2VGA'
What happens to the blank lines?
>>> lines[8]
'\n'
>>> lines[8].strip()
'' # This is a string that is zero length. The Empty String.
Use a list comprehension to strip all the lines!
# Create a cleaned version.
>>> cleanlines = [ line.strip() for line in lines ]
More verbosely, and slowly we could have done this:
cleanlines = []
for line in lines:
cleaned = line.strip()
cleanlines.append(cleaned)
Take a look:
>>> cleanlines[0]
'product/productId: B003AI2VGA'
>>> cleanlines[8]
''
Now group them into a list of lists!
list_of_lists = []
group = []
for line in cleanlines:
if line != '':
# Add to the current group
group.append(line)
else:
# Starting a new group.
# Append the group to the grouped lines and then
list_of_lists.append(group)
# Point 'group' to a newly created list.
# This does not affect the list we just appended to groupedlines
group = []
Look at the first group in the list:
>>> from pprint import pprint ## Saves some typing later
>>> pprint(list_of_lists[100])
['product/productId: B000063W1R',
'review/userId: A2CU4L3OK00M23',
'review/profileName: C. Reich',
'review/helpfulness: 1/1',
'review/score: 5.0',
'review/time: 1189036800',
'review/summary: A classic movie',
'review/text: There is nothing bad to say about this movie. The story, the acting, everything is well done. I bought it to watch over and over.']
Now let's get a list_of_dictionaries where the key is the label. Here's an example of how we can extract a key/value pair from each line:
>>> line = list_of_lists[0][3]
>>> line
'review/helpfulness: 7/7'
>>> line.split(': ', 1)
['review/helpfulness', '7/7']
So let's make a list of dictionaries where each dictionary has an item that corresponds to:
['review/helpfulness', '7/7']
Which we turn into a dictionary item like this:
[ { ... }, { ... }, # bunch of dictionaries
{ ... 'review/helpfulness', '7/7', ... }, # a specific item in a dictionary
{ ... } ]
Here we go:
list_of_dicts = []
for group in list_of_lists:
# Create a new dict for each group.
group_dict = {}
for line in group:
# Split on ': ' once, giving me two pieces.
longkey, value = line.split(': ', 1)
# Get the second part of product/productId
shortkey = longkey.split('/')[1]
group_dict[shortkey] = value
# We finished the group. Save it into the list_of_dicts.
list_of_dicts.append(group_dict)
Everything go OK? Quick check.
>>> len(list_of_dicts)
5554
>>> len(list_of_lists)
5554
>>> list_of_dicts[0]
>>> pprint(list_of_dicts[0])
{'helpfulness': '7/7',
'productId': 'B003AI2VGA',
'profileName': 'Brian E. Erland "Rainbow Sphinx"',
'score': '3.0',
'summary': '"There Is So Much Darkness Now ~ Come For The Miracle"',
'text': 'Synopsis: On the daily trek from Juarez, Mexico to El Paso, Texas an ever increasing number of female workers are found raped and murdered in the surrounding desert. Investigative reporter Karina Danes (Minnie Driver) arrives from Los Angeles to pursue the story and angers both the local police and the factory owners who employee the undocumented aliens with her pointed questions and relentless quest for the truth.<br /><br />Her story goes nationwide when a young girl named Mariela (Ana Claudia Talancon) survives a vicious attack and walks out of the desert crediting the Blessed Virgin for her rescue. Her story is further enhanced when the "Wounds of Christ" (stigmata) appear in her palms. She also claims to have received a message of hope for the Virgin Mary and soon a fanatical movement forms around her to fight against the evil that holds such a stranglehold on the area.<br /><br />Critique: Possessing a lifelong fascination with such esoteric matters as Catholic mysticism, miracles and the mysterious appearance of the stigmata, I was immediately attracted to the \'05 DVD release `Virgin of Juarez\'. The film offers a rather unique storyline blending current socio-political concerns, the constant flow of Mexican migrant workers back and forth across the U.S./Mexican border and the traditional Catholic beliefs of the Hispanic population. I must say I was quite surprised by the unexpected route taken by the plot and the means and methods by which the heavenly message unfolds.<br /><br />`Virgin of Juarez\' is not a film that you would care to watch over and over again, but it was interesting enough to merit at least one viewing. Minnie Driver delivers a solid performance and Ana Claudia Talancon is perfect as the fragile and innocent visionary Mariela. Also starring Esai Morales and Angus Macfadyen (Braveheart).',
'time': '1182729600',
'userId': 'A141HP4LYPWMSR'}
Are we happy with this?
It would be nice if the review/score were a float type. Also, maybe we should split out the review/helpfulness value into a pair. Finally, we could convert 'review/time' into something readable.
# Modify in place.
import datetime
for review in list_of_dicts:
# Convert to float
review['score'] = float(review['score'])
# Convert to readable time
int_timestamp = int(review['time'])
datetime_obj = datetime.datetime.fromtimestamp(int_timestamp)
str_iso_datetime = datetime_obj.isoformat()
review['time'] = str_iso_datetime
# Modify in place some more!!
# Convert helpfulness to a float.
for review in list_of_dicts:
helpful_str, numberof_str = review['helpfulness'].split('/')
helpful_int = int(helpful_str)
numberof_int = int(numberof_str)
if numberof_int == 0: # Why is this needed?
numberof_int = 1.0
review['helpfulness'] = float(helpful_int)/numberof_int
review['helpful'] = helpful_int
review['total'] = numberof_int
Check out our change
>>> pprint(list_of_dicts[501])
{'helpful': 1,
'helpfulness': 0.5,
'productId': '0790747324',
'profileName': 'Third Grade Teacher',
'score': 5.0,
'summary': 'Travel in Time with HG Wells',
'text': "Several 3rd grade classes in my school viewed this movie (rated G)after reading a version of the book. We'd read it chapter by chapter and reviewed with EOG-style questions. Wow-talk about a page-turner! The children could not wait to read the next chapter just as soon as we finished the last. Watching the movie was a fun way to see the images we'd created in our minds come to life. This was the original version (I think the best) and because it is rated G, I felt very comfortable showing it to my students.",
'time': '2009-05-24T17:00:00',
'total': 2,
'userId': 'A2QP12447GKEDU'}
This is good enough. Now let's output as TSV file.
writer = csv.DictWriter(open('small.tsv', 'w'), # file to write to
['productId', 'helpful', 'score', 'profileName'], # columns to output
delimiter='\t', # delimiter (separator)
extrasaction='ignore') # ignore columns in the dict if the are skipped
Write the header and print out each row.
writer.writeheader()
for review in list_of_dicts:
writer.writerow(review)
Now take a look at the output on the command line!
andrews-mbp:data_science andrew$ head -n 10 small.tsv
productId helpful score profileName
B003AI2VGA 7 3.0 "Brian E. Erland ""Rainbow Sphinx"""
B003AI2VGA 4 3.0 Grady Harp
B003AI2VGA 8 5.0 "Chrissy K. McVay ""Writer"""
B003AI2VGA 1 3.0 golgotha.gov
B003AI2VGA 1 3.0 "KerrLines """Movies,Music,Theatre""""
B003AI2VGA 0 2.0 "abra ""a devoted reader"""
B003AI2VGA 3 1.0 Charles R. Williams
B00006HAXW 64 5.0 Anthony Accordino
B00006HAXW 26 5.0 Joseph P. Aiello
Congratulations! We're done! There are lots of tools that can now read in this TSV file!
Actually, this file is large. Let's use the first 100 lines and try that instead.
head -n 100 small.tsv > tiny.tsv