Programming for Everybody: Assignment 08.5 Lists - edorlando07/datasciencecoursera Wiki

###Python Data Structures

8.5 Open the file mbox-short.txt and read it line by line. When you find a line that starts with 'From ' like the following line:

From [email protected] Sat Jan 5 09:14:16 2008

You will parse the From line using split() and print out the second word in the line (i.e. the entire address of the person who sent the message). Then print out a count at the end. Hint: make sure not to include the lines that start with 'From:'.

You can download the sample data at http://www.pythonlearn.com/code/mbox-short.txt

A section of the sample file is listed below:

From [email protected] Sat Jan  5 09:14:16 2008
Return-Path: <[email protected]>
Received: from murder (mail.umich.edu [141.211.14.90])
 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
 Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
 Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
 5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <[email protected]>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Received: from prod.collab.uhi.ac.uk ([194.35.219.182])
      by paploo.uhi.ac.uk (JAMES SMTP Server 2.1.3) with SMTP ID 899
      for <[email protected]>;
      Sat, 5 Jan 2008 14:09:50 +0000 (GMT)
Received: from nakamura.uits.iupui.edu (nakamura.uits.iupui.edu [134.68.220.122])
by shmi.uhi.ac.uk (Postfix) with ESMTP id A215243002
for <[email protected]>; Sat,  5 Jan 2008 14:13:33 +0000 (GMT)
Received: from nakamura.uits.iupui.edu (localhost [127.0.0.1])
by nakamura.uits.iupui.edu (8.12.11.20060308/8.12.11) with ESMTP id m05ECJVp010329
for <[email protected]>; Sat, 5 Jan 2008 09:12:19 -0500
Received: (from [email protected])
by nakamura.uits.iupui.edu (8.12.11.20060308/8.12.11/Submit) id m05ECIaH010327
for [email protected]; Sat, 5 Jan 2008 09:12:18 -0500
Date: Sat, 5 Jan 2008 09:12:18 -0500
X-Authentication-Warning: nakamura.uits.iupui.edu: apache set sender to [email protected] using -f
To: [email protected]
From: [email protected]
Subject: [sakai] svn commit: r39772 - content/branches/sakai_2-5-x/content-    
impl/impl/src/java/org/sakaiproject/content/impl
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
Content-Type: text/plain; charset=UTF-8  

The actual code is listed below:

fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"

count = 0

fh = open(fname)
for line in fh:
    line = line.rstrip()
    if not line.startswith('From '):continue
    words = line.split()                     #The method split() returns a list of all the words in the string, 
                                             #using str as the separator (splits on all whitespace if left  
                                             #unspecified), optionally limiting the number of splits to num.
    print words[1]                           #Returns the 2nd word in the strings, which is the email address
    count = count + 1

print "There were", count, "lines in the file with From as the first word"

The input/output of the code is listed below:

Enter file name: mbox-short.txt
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
There were 27 lines in the file with From as the first word