Programming for Everybody: Assignment 09.4 Dictionaries - edorlando07/datasciencecoursera GitHub Wiki
###Python Data Structures
9.4 Write a program to read through the mbox-short.txt and figure out who has the sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file. After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.
A sample section of the text file is listed below:
From [email protected] Sat Jan 5 09:14:16 2008
Return-Path: <[email protected]>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ;
5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
Sat, 5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <[email protected]>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Received: from prod.collab.uhi.ac.uk ([194.35.219.182])
by paploo.uhi.ac.uk (JAMES SMTP Server 2.1.3) with SMTP ID 899
for <[email protected]>;
Sat, 5 Jan 2008 14:09:50 +0000 (GMT)
impl/impl/src/java/org/sakaiproject/content/impl
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
Content-Type: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan 5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
The actual code is listed below:
fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
FinalList = list() #Sets up empty list to be used later to concatenate
#all emails into one list
counts = dict() #Sets up an empty dictionary to be used later to
#count the number of emails from each person
fh = open(fname)
for line in fh: #starts the loop that is reading each line
line = line.rstrip() #strips whitespace at the end of each line
if not line.startswith('From '):continue
words = line.split()
words = words[1] #grabs the second word in the line which is the email address
FinalList.append(words) #Take each tuple and convert it into one large list
#called FinalList
If the FinalList was printed right now, the data would look like this:
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]']
###A note on counts.get
The dictionary get() method allows for a default as the second argument, if the key doesn't exist. So counts.get(w,0) gives you 0 if w doesn't exist in counts.
for name in FinalList:
counts[name] = counts.get(name,0) + 1
If the counts dictionary was printed right now, the following would be printed:
{'[email protected]': 1, '[email protected]': 3, '[email protected]': 5, '[email protected]':
1, '[email protected]': 2, '[email protected]': 3, '[email protected]': 4, '[email protected]': 1,
'[email protected]': 4, '[email protected]': 2, '[email protected]': 1}
The next section of code will find the max number of email addresses listed in the dictionary
maxValue = None
maxKey = None
#Loops through dictionary to find the max Value within
#the dicitonary
for Key,Value in counts.items() :
if Value > maxValue:
maxValue = Value
maxKey = Key
print maxKey, maxValue
The input/output for the code listed above is the following:
Enter file name: mbox-short.txt
[email protected] 5