Web Scraping - robbiehume/CS-Notes GitHub Wiki
- Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents
- It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet
- Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents
-
from bs4 import BeautifulSoup with open('page.html', 'r') as f: html_doc = f.read() soup = BeautifulSoup(html_doc, 'html.parser') # Parse using BeautifulSoup print(soup.body.div.p.text) # print the p element content
-
NOTE: to handle the response decoding process effectively, it is always better to use
.content
instead of.text
while using Beautiful Soup-
response = requests.get("https://quotes.toscrape.com/") soup = BeautifulSoup(response.content, 'html.parser')
-
-
Method Purpose Example Usage BeautifulSoup(html, parser)
Create a BeautifulSoup object soup = BeautifulSoup(html, 'html.parser')
find(name, attrs)
Find first matching tag soup.find('p', class_='title')
find_all(name, attrs)
Find all matching tags soup.find_all('a')
select(css_selector)
Find tags using CSS selectors soup.select('div > p.title')
select_one(css_selector)
Find first match using CSS selectors soup.select_one('p.content')
tag.text
ortag.get_text()
Extract text inside a tag title.get_text()
tag['attribute']
Get a specific attribute link['href']
tag.attrs
Get all attributes of a tag tag.attrs
tag.parent
Access the parent tag p_tag.parent
tag.children
Iterate over direct children for child in div_tag.children
tag.descendants
Iterate over all nested children for descendant in div_tag.descendants
tag.next_sibling
Go to the next sibling p_tag.next_sibling
tag.previous_sibling
Go to the previous sibling p_tag.previous_sibling
tag.string.replace_with('new text')
Replace text inside a tag h1.string.replace_with('New Title')
tag.decompose()
Remove the tag from the tree p_tag.decompose()
-
Initialize tree parser:
soup = BeautifulSoup(html_doc, 'html.parser')
-
Get element(s):
soup.body.div ...
-
View formatted element(s):
soup.prettify()
;soup.body.div.prettify()
-
Get element attribute:
soup.body.div.a.get('href')
-
.find(name, attrs, recursive, string)
: finds the first matching tag -
.find_all(name, attrs, recursive, string)
: finds all matching tags (returns a list) -
.select(css_selector)
: find tags using CSS selectors (powerful) -
.select_one(css_selector)
: find the first match via CSS selector -
.get_text()
: extracts all text inside a tag
-
tag.string.replace_with('new text')
: replace text inside a tag -
tag.insert(position, new_tag)
: insert a new tag -
tag.decompose()
: remove a tag from the tree
-
from bs4 import BeautifulSoup html = '<div><p class="title">Title</p><p class="content">Content</p></div>' soup = BeautifulSoup(html, 'html.parser') title = soup.find('p', class_='title') # find a <p> with class="title" print(title.text) # Output: Title content = soup.select_one('p.content') # select using CSS selector print(content.get_text()) # Output: Content
-
.tag
- Returns HTML object with the tag selected
- It can be used consecutively to reach a specific tab by following its children
-
.contents
vs.children
- Children of tags can be found in the
.content
list. Instead of retrieving the list, we may use the.children
generator to iterate through a tag’s children.
- Children of tags can be found in the
-
.descendants
- Recursively returns all the children and their children (all the sub-HTML trees) of the tag
-
.strings
vs.stripped_strings
-
.strings returns
all strings in the HTML document, including whitespace characters and strings nested within tags -
.stripped_strings
returns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
-
-
.parent
vs.parents
-
.parent
returns the immediate parent of the current tag -
.parents
returns an iterator that allows iterating over all the parents of the current tag.
-
-
.next_sibling
vs.previous_sibling
-
.next_sibling
returns the following sibling tag of the current tag -
.previous_sibling
returns the previous sibling tag of the current tag.
-
-
.next_element
vs.previous_element
-
.next_element
returns the next element in the parse tree after the current element -
.previous_element
returns the previous element in the parse tree before the current element.
-