10 01 Importing data from the Internet - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Importing flat files from the web

The urllib package

  • Provides interface for fetching data across the web
  • urlopen() : accepts URLs instead of file names

Automate file download in python

  • urlretrive(url, 'name.csv')
# Import package
from urllib.request import urlretrieve

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'winequality-red.csv')

Read flat files from the web directly

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

Importing non-flat files from the web

  • pd.read_excel() to import an Excel spreadsheet. - sheet_name = None : import all sheets - Return a dictionary, where keys are the names of sheets
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xls
xls = pd.read_excel(url, sheet_name=None)

# Print the sheetnames to the shell
print(xls.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xls['1700'].head())

HTTP requests to import files from the web

URL

  • Uniform/Universal Resource Locator
  • References to web resources
  • Focus: web addresses
  • Ingredients:
    • Protocol identifier - http:
    • Resource name - datacamp.com
    • Together specify web addresses uniquely

HTTP

  • HyperText Transfer Protocol
  • Foundation of data communication for the web
  • HTTPS - more secure form of HTTP
  • Going to a website = sending HTTP request
    • GET request
  • urlretrieve() performs a GET request
  • HTML - HyperText Markup Language

GET requests

using urllib

# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "https://campus.datacamp.com/courses/1606/4135?ex=2"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Print the datatype of response
print(type(response))  # <class 'http.client.HTTPResponse'>

# Extract the response: html
html = response.read()

# Be polite and close the response!
response.close()

using (the package) requests

# Import package
import requests

# Specify the url: url
url = "http://www.datacamp.com/teach/documentation" 

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

Scraping the web in Python

HTML

  • Mix of unstructured and structured data
  • Structured data:
    • Has pre-defined data model, or
    • Organized in a defined manner
  • Unstructured data: neither of these properties

BeautifulSoup

  • Parse and extract structured data from HTML
  • Make tag soup beautiful and extract information
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

Turning a webpage into data using BeautifulSoup: getting the text

  • soup.title
  • soup.get_text())
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text)

Turning a webpage into data using BeautifulSoup: getting the hyperlinks

  • find_all()
    • to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag <a> but passed to find_all() without angle brackets;
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))