Programming for Everybody: Assignment 09.4 Dictionaries - edorlando07/datasciencecoursera GitHub Wiki

###Python Data Structures

9.4 Write a program to read through the mbox-short.txt and figure out who has the sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file. After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.

A sample section of the text file is listed below:

From [email protected] Sat Jan  5 09:14:16 2008
Return-Path: <[email protected]>
Received: from murder (mail.umich.edu [141.211.14.90])
	     by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
     Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
     by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
     Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
    by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
    Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
    BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
    5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
   by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
   Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <[email protected]>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Received: from prod.collab.uhi.ac.uk ([194.35.219.182])
          by paploo.uhi.ac.uk (JAMES SMTP Server 2.1.3) with SMTP ID 899
          for <[email protected]>;
          Sat, 5 Jan 2008 14:09:50 +0000 (GMT)
impl/impl/src/java/org/sakaiproject/content/impl
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
Content-Type: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772  

The actual code is listed below:

fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"

FinalList = list()           #Sets up empty list to be used later to concatenate 
                             #all emails into one list
counts = dict()              #Sets up an empty dictionary to be used later to 
                             #count the number of emails from each person

fh = open(fname)
for line in fh:              #starts the loop that is reading each line
    line = line.rstrip()     #strips whitespace at the end of each line
    if not line.startswith('From '):continue
    words = line.split()
    words = words[1]         #grabs the second word in the line which is the email address
    FinalList.append(words)  #Take each tuple and convert it into one large list 
                             #called FinalList  

If the FinalList was printed right now, the data would look like this:

['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]',     
'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]',   
'[email protected]', '[email protected]', '[email protected]', '[email protected]', 
'[email protected]', '[email protected]', '[email protected]', '[email protected]', 
'[email protected]', '[email protected]', '[email protected]', '[email protected]', 
'[email protected]', '[email protected]', '[email protected]', '[email protected]']  

###A note on counts.get
The dictionary get() method allows for a default as the second argument, if the key doesn't exist. So counts.get(w,0) gives you 0 if w doesn't exist in counts.

for name in FinalList:
    counts[name] = counts.get(name,0) + 1  

If the counts dictionary was printed right now, the following would be printed:

{'[email protected]': 1, '[email protected]': 3, '[email protected]': 5, '[email protected]': 
1, '[email protected]': 2, '[email protected]': 3, '[email protected]': 4, '[email protected]': 1, 
'[email protected]': 4, '[email protected]': 2, '[email protected]': 1}  

The next section of code will find the max number of email addresses listed in the dictionary

maxValue = None
maxKey = None

#Loops through dictionary to find the max Value within
#the dicitonary

for Key,Value in counts.items() :
  if Value > maxValue:
      maxValue = Value
      maxKey = Key   

print maxKey, maxValue

The input/output for the code listed above is the following:

Enter file name: mbox-short.txt
[email protected] 5