4.5.3.Webscraping - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Objectives

  • Define Webscraping
  • Beautiful Soup Objects
  • Find_all
  • Webscraping a website

Webscraping

a process that can be used to automatically extract information from a website, and can easily be accomplished within a matter of minutes and not hours

Beautiful Soup

from bs4 import BeautifulSoup

html="<!DOCTYPE html><html><head><title>Page ......"

soup = BeautifulSoup(html, 'html5lib')

Beautiful Soup Objects

tag_object = soup.title

tag_object = soup.h3

tag_child = tag_object.b

parent_tag = tag_child.parent

sibling_1 = tag_object.next_sibling

sibling_2 = sibling_1.next_sibling

tag_child.attrs
tag_child.string

find_all

table_row = table.find_all(name='tr')

#Tag Object
first_row = table_row[0]
first_row.td

#Variable row
for i, row in enumerate(table_rows):
	print("row", i)
	cells = row.find_all("td")
	
	for j, cell in enumerate(cells):
		print("column", j, "cell", cell)
⚠️ **GitHub.com Fallback** ⚠️