Programming for Everybody: Assignment 12.0: Retrieving Data from Web - edorlando07/datasciencecoursera GitHub Wiki

###Parsing HTML with Beautiful Soup

A copy of the entire html page that we are scanning is listed below:
The web page can be found at www.dr-chuck.com

<html>
<head>
  <title>Dr. Charles R. Severance Home Page</title>
  <meta name="verify-v1" content="WQuA2ZPREiCyTlgNh/fv0jvzKJxrpzlagjiPaakSNH0=" />
<style type="text/css">
body { background: black; font-family: Arial,Helvetica,Verdana,Sans-Serif; color: white;}
body table { font-size: 11pt; }
a:link, a:visited, a:active { color: gray; text-decoration: none; font-weight: bold}
a:hover { color:orange }
strong {color: orange; }
em {color: yellow; font-style: normal;}

#twitter_div {
color: yellow;
text-align: center;
}
#twitter_div ul {
display: inline;
fornt-size: 75%;
margin:0px 0px 0px 0px;
padding:0px 0px 0px 0px;
list-style-type: none;
}
#twitter_div p {
margin:0px 0px 0px 0px;
text-align: center;
}
</style>
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.dr-chuck.com/csev-blog/index.rdf" />
<link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.dr-chuck.com/csev-blog/atom.xml" />
<meta name="google-translate-customization" content="502d2c1a267d1206-8efe060c714e194c-g94a06c6c571083ae-11"></meta>
<script type="text/javascript" src="http://gc.kis.v2.scr.kaspersky-labs.com/CC57686F-4379-E64C-8211-   
3D222726DB61/main.js" charset="UTF-8"></script></head>
<body>
<table border=0>
<tr>
<td align=center valign=top width=180>
<a href="http://www.dr-chuck.com/csev-blog/">
<img align="center" src="csev_ian_dolphin_small.jpg" width="160" alt="Photo Credit: Ian Dolphin">
</a>
<br>
<br> <a href="http://www.si.umich.edu/" target="_blank">School of Information</a>
<br/>
<a href="http://www.ratemyprofessors.com/ShowRatings.jsp?tid=1159280" target="_blank">
<img src="/images/rate-my-professor.jpg" width="140" alt="Rate This Professor"></a>
<br> &nbsp
<br> <a href=http://www.dr-chuck.com/csev-blog/>Blog</a>
<br> <a href=http://www.twitter.com/drchuck/ target=_blank>@drchuck Twitter</a>
<br> <a href="http://www.dr-chuck.com/dr-chuck/resume/speaking.htm" target=_blank>Keynote Speaker</a>
<br/> <a href="http://www.slideshare.net/csev" target="_blank">Slideshare</a>
<br> <a href=/dr-chuck/resume/index.htm target=_blank>Resume and Bio</a>
<br>
<a target="_blank" href="http://amzn.to/1K5Q81K">Amazon Author Page</a>
<br> <a href="http://afs.dr-chuck.com/papers/" target=_blank>Chuck's Papers</a>
<br> <a href="https://itunes.apple.com/us/podcast/computing-conversations/id731495760" target="_blank">IEEE Audio     
Podcast</a>
<br> <a href="http://www.youtube.com/playlist?list=PLHJB2bhmgB7dFuY7HmrXLj5BmHGKTD-3R" target="_blank">IEEE Video    
Interviews</a>
<br> <a href="http://developers.imsglobal.org/" target="_blank">IMS LTI</a>
<br> <a href="http://www.youtube.com/user/csev" target=_blank>YouTube Channel </a>
<br> <a href="http://vimeo.com/drchuck/videos" target=_blank>Video on Vimeo</a>
<br> <a href="https://backpack.openbadges.org/share/4f76699ddb399d162a00b89a452074b3/" target="_blank">My Open    
Badges</a>
<br>
&nbsp; 
<br/>
<a href="http://www.linkedin.com/pub/chuck-severance/2/92a/3a8" >
<img src="http://www.linkedin.com/img/webpromo/btn_viewmy_120x33.png" 
    width="120" height="33" border="0" alt="View Chuck Severance's profile on LinkedIn">
</a>
<br/>
<a title="Charles R. Severance" href="https://www.researchgate.net/profile/Charles_Severance/" target="_blank"><img    
src="https://www.researchgate.net/images/public/profile_share_badge.png" alt="Charles R. Severance" /></a>
</td>
<td valign=top>
<div style="float: right">
<script data-gittip-username="drchuck"
        data-gittip-widget="button"
        src="//gttp.co/v1.js"></script>
</div>
<strong>
New:
</strong>
<a href="http://www.tsugi.org/" target="_blank">Tsugi: A PHP framework for IMS LTI Tools</a> <br/>
<strong>
New:
</strong>
<a href="http://youtu.be/slscHD40r78" target="_blank">MOOCs: Charles Severance at TEDxKalamazoo</a> 
<p>
Free Courses / Educational Material:
<br> &nbsp;
<a href="https://www.coursera.org/course/pythonlearn" target="_blank">Coursera: Programming for Everybody</a> (Python)
<br> &nbsp;
<a href="https://www.coursera.org/course/insidetheinternet" target="_blank">Coursera: Internet History, Technnology    
and Security</a>
<br> &nbsp;
<a href="http://open.umich.edu/education/si/si502/winter2009/" target=_blank>SI 502 - Networked Computing</a> 
<br> &nbsp; See also <a href=http://www.pythonlearn.com target=_blank>www.pythonlearn.com</a>,
<a href="http://www.php-intro.com/" target="_blank">
www.php-intro.com</a>
and
<a href=http://www.appenginelearn.com/>
www.appenginelearn.com</a>
</p>
<p>
Books
<br> &nbsp; <a href="http://www.pythonlearn.com/" /target="_blank">Python For Informatics: Exploring Information</a>       
\(2010, 2014)
<br> &nbsp; <a href="/sakai-book">Sakai: Building an Open Source Community</a> (2011, 2014)
<br> &nbsp; <a href="http://www.amazon.com/gp/product/1624311393/ref=as_li_ss_tl?   
ie=UTF8&camp=1789&creative=390957&creativeASIN=1624311393&linkCode=as2&tag=drchu02-20" target="_blank">Raspberry Pi     
(21st Century Skills Innovation Library)</a> (2013)
<br> &nbsp; <a href="http://www.amazon.com/gp/product/059680069X/ref=as_li_ss_tl?   
ie=UTF8&camp=1789&creative=390957&creativeASIN=059680069X&linkCode=as2&tag=drchu02-20" target="_blank">Using Google    
App Engine</a> (O'Reilly 2009)
<br> &nbsp; <a href="http://www.amazon.com/Performance-Computing-Architectures-Optimization-Benchmarks/dp/156592312X/"    
target="_blank">High Performance Computing</a> (<a href="http://oreilly.com/catalog/9781565923126/"    
target="_blank">O'Reilly 1998</a>,  <a href="http://cnx.org/content/col11136/latest/" target=_blank>Connexions    
2010</a>)
</p>
Web/Multimedia sites
<br> &nbsp;
<a href="http://www.youtube.com/playlist?list=PLHJB2bhmgB7dFuY7HmrXLj5BmHGKTD-3R" target="_blank">
IEEE Computer - Computing Conversations Interviews</a>
(2011-present)
<br> &nbsp;
<a href=http://www.vimeo.com/17207620 target="_blank">
Dr. Chuck sings the blues </a> (2008)
<br> &nbsp;
<a href="http://www.youtube.com/watch?v=BVKpW02hsrU" target="_blank">
Dr. Chuck goes motocross racing</a> (2007)
<br> &nbsp;
<a href="http://www.youtube.com/watch?v=sa2WsgCvn7c" target="_blank">
A Film About Brent and His ATV
</a> (2005)
<br> &nbsp;
<a href="http://www.vimeo.com/17213019" target="_blank">
Audition Tape</a> (2003) for TechTV which was
rejected :(.
<br> &nbsp;
<a href="http://www.youtube.com/watch?v=FJ078sO35M0" target="_blank">
Dr. Chuck goes stock car racing</a> (2002)
<br> &nbsp;
<a href="http://afs.dr-chuck.com/citoolkit" target=_blank>
The Community Information Toolkit</a> - A project to provide public libraries and 
other organizations a start on using Internet in Commmunity Networking. (1999)
<p>
Software
<br> &nbsp;
<a href="http://www.sakaiproject.org/" target="_blank">The Sakai Collaboration
and Learning Environment</a>
<br> &nbsp;
<a href="http://www.tsugi.org/ target="_blank">Tsugi: A PHP framework for IMS LTI Tools</a>
<br> &nbsp;
<a href="http://developers.imsglobal.org/" target="_blank">IMS Learning Tools
Interoperability</a>
<br> &nbsp;
<a href="/obi-sample" target=_blank>My Reference Implementation of Mozilla Open Badges in PHP</a>
<br> &nbsp;
</td>
<td valign=top align=center width=180>
<div id="google_translate_element"></div><script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en', layout:    
google.translate.TranslateElement.InlineLayout.SIMPLE, gaTrack: true, gaId: 'UA-423997-1'},    
'google_translate_element');
}
</script><script type="text/javascript" src="//translate.google.com/translate_a/element.js?   
cb=googleTranslateElementInit"></script>
<br/>&nbsp;<br/>
<a class="twitter-timeline" href="https://twitter.com/drchuck" data-widget-id="282172185219567616">Tweets by    
@drchuck</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)   
[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id))   
{js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}     
(document,"script","twitter-wjs");</script>
</td>
</tr>
</table>
<br> &nbsp;
<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-423997-1']);
  _gaq.push(['_setDomainName', 'dr-chuck.com']);
  _gaq.push(['_trackPageview']);  

   

</script>
</body>
</html>  

###Important note:
Make sure the Python file is saved in this location.

C:\Python27\Lib\beautifulsoup4-4.5.3.tar\dist\beautifulsoup4-4.5.3\bs4

###The actual code is listed below:

from bs4 import BeautifulSoup
import urllib

url = raw_input('Enter website: ')

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve a list of anchor tags
# Each tag is like a dictionary of HTML attributes
# "a" is just an example of an anchor tag.  They
# could be "b", "p", "c", etc.

tags = soup('a')


for tag in tags:                 #loops through entire
    print tag.get('href', None)  #page finding all the 
                                 #anchor tags  

###The output for the code includes the following:

http://www.dr-chuck.com/csev-blog/
http://www.si.umich.edu/
http://www.ratemyprofessors.com/ShowRatings.jsp?tid=1159280
http://www.dr-chuck.com/csev-blog/
http://www.twitter.com/drchuck/
http://www.dr-chuck.com/dr-chuck/resume/speaking.htm
http://www.slideshare.net/csev/dr-chuck/resume/index.htm
http://amzn.to/1K5Q81K
http://afs.dr-chuck.com/papers/
https://itunes.apple.com/us/podcast/computing-conversations/id731495760
http://www.youtube.com/playlist?list=PLHJB2bhmgB7dFuY7HmrXLj5BmHGKTD-3R
http://developers.imsglobal.org/
http://www.youtube.com/user/csev
http://vimeo.com/drchuck/videos
https://backpack.openbadges.org/share/4f76699ddb399d162a00b89a452074b3/
http://www.linkedin.com/pub/chuck-severance/2/92a/3a8
https://www.researchgate.net/profile/Charles_Severance/
http://www.tsugi.org/
http://youtu.be/slscHD40r78
https://www.coursera.org/course/pythonlearn
https://www.coursera.org/course/insidetheinternet
http://open.umich.edu/education/si/si502/winter2009/
http://www.pythonlearn.com
http://www.php-intro.com/
http://www.appenginelearn.com/
http://www.pythonlearn.com/
/sakai-book
http://www.amazon.com/gp/product/1624311393/ref=as_li_ss_tl?    
ie=UTF8&camp=1789&creative=390957&creativeASIN=1624311393&linkCode=as2&tag=drchu02-20
http://www.amazon.com/gp/product/059680069X/ref=as_li_ss_tl?   
ie=UTF8&camp=1789&creative=390957&creativeASIN=059680069X&linkCode=as2&tag=drchu02-20
http://www.amazon.com/Performance-Computing-Architectures-Optimization-Benchmarks/dp/156592312X/
http://oreilly.com/catalog/9781565923126/
http://cnx.org/content/col11136/latest/
http://www.youtube.com/playlist?list=PLHJB2bhmgB7dFuY7HmrXLj5BmHGKTD-3R
http://www.vimeo.com/17207620
http://www.youtube.com/watch?v=BVKpW02hsrU
http://www.youtube.com/watch?v=sa2WsgCvn7c
http://www.vimeo.com/17213019
http://www.youtube.com/watch?v=FJ078sO35M0
http://afs.dr-chuck.com/citoolkit
http://www.sakaiproject.org/
http://www.tsugi.org/ target=
http://developers.imsglobal.org/
/obi-sample
https://twitter.com/drchuck
⚠️ **GitHub.com Fallback** ⚠️