Programming for Everybody: Assignment 10.2 Tuples and Sorting - edorlando07/datasciencecoursera GitHub Wiki

###Python Data Structures

10.2 Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon.

From [email protected] Sat Jan 5 09:14:16 2008

Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.

A sample section of the text file is listed below:

From [email protected] Sat Jan  5 09:14:16 2008
Return-Path: <[email protected]>
Received: from murder (mail.umich.edu [141.211.14.90])
	     by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
     Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
     by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
     Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
    by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
    Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
    BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
    5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
   by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
   Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <[email protected]>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Received: from prod.collab.uhi.ac.uk ([194.35.219.182])
          by paploo.uhi.ac.uk (JAMES SMTP Server 2.1.3) with SMTP ID 899
          for <[email protected]>;
          Sat, 5 Jan 2008 14:09:50 +0000 (GMT)
impl/impl/src/java/org/sakaiproject/content/impl
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
Content-Type: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

###The actual code starts below:

fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"

counts = dict()

fh = open(fname)
for line in fh:

    line = line.rstrip()                     #strips whitespace at the end of each line
    if not line.startswith('From '):continue #if line starts with 'From ' it will perform
                                             #the rest of the code.
    words = line.split()                     #splits up each word in each line

###If words was printed now, the output would include the following:

Enter file name: mbox-short.txt
['From', '[email protected]', 'Sat', 'Jan', '5', '09:14:16', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '18:10:48', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '16:10:39', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '15:46:24', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '15:03:18', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '14:50:18', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '11:37:30', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '11:35:08', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '11:12:37', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '11:11:52', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '11:11:03', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '11:10:22', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '10:38:42', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '10:17:43', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '10:04:14', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '09:05:31', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '07:02:32', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '06:08:27', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '04:49:08', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '04:33:44', '2008']
['From', '[email protected]', 'Fri', 'Jan', '4', '04:07:34', '2008']
['From', '[email protected]', 'Thu', 'Jan', '3', '19:51:21', '2008']
['From', '[email protected]', 'Thu', 'Jan', '3', '17:18:23', '2008']
['From', '[email protected]', 'Thu', 'Jan', '3', '17:07:00', '2008']
['From', '[email protected]', 'Thu', 'Jan', '3', '16:34:40', '2008']
['From', '[email protected]', 'Thu', 'Jan', '3', '16:29:07', '2008']
['From', '[email protected]', 'Thu', 'Jan', '3', '16:23:48', '2008']

###The next piece of the code is continued below

    words = words[5]               #grabs the 5th word in the line 
                                   #which is the email address

###If words was printed now, the output would include the following:

Enter file name: mbox-short.txt
09:14:16
18:10:48
16:10:39
15:46:24
15:03:18
14:50:18
11:37:30
11:35:08
11:12:37
11:11:52
11:11:03
11:10:22
10:38:42
10:17:43
10:04:14
09:05:31
07:02:32
06:08:27
04:49:08
04:33:44
04:07:34
19:51:21
17:18:23
17:07:00
16:34:40
16:29:07
16:23:48

###The next piece of the code is continued below:

    hour = words.split(":")  #splits the HH:MM:SS into
                             #three separate variables
    hour = hour[0]           #grabs the 1st variable HH
    
    counts[hour] = counts.get(hour,0) + 1  #counts number of times
                                           #each hour is found in the
                                           #data set
                         
for key in sorted(counts):                 #prints a sorted dictionary
    print key, counts[key]

#The final output is delivered below:

Enter file name: mbox-short.txt
04 3
06 1
07 1
09 2
10 3
11 6
14 1
15 2
16 4
17 2
18 1
19 1