Web Scraping - robbiehume/CS-Notes GitHub Wiki

Overview

Beautiful Soup

General

Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents
It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet
Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents

Initiate tree

from bs4 import BeautifulSoup

with open('page.html', 'r') as f:
    html_doc = f.read()
soup = BeautifulSoup(html_doc, 'html.parser')  # Parse using BeautifulSoup
print(soup.body.div.p.text)  # print the p element content

NOTE: to handle the response decoding process effectively, it is always better to use .content instead of .text while using Beautiful Soup
- ```
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
```

Methods

Method	Purpose	Example Usage
`BeautifulSoup(html, parser)`	Create a BeautifulSoup object	`soup = BeautifulSoup(html, 'html.parser')`
`find(name, attrs)`	Find first matching tag	`soup.find('p', class_='title')`
`find_all(name, attrs)`	Find all matching tags	`soup.find_all('a')`
`select(css_selector)`	Find tags using CSS selectors	`soup.select('div > p.title')`
`select_one(css_selector)`	Find first match using CSS selectors	`soup.select_one('p.content')`
`tag.text` or `tag.get_text()`	Extract text inside a tag	`title.get_text()`
`tag['attribute']`	Get a specific attribute	`link['href']`
`tag.attrs`	Get all attributes of a tag	`tag.attrs`
`tag.parent`	Access the parent tag	`p_tag.parent`
`tag.children`	Iterate over direct children	`for child in div_tag.children`
`tag.descendants`	Iterate over all nested children	`for descendant in div_tag.descendants`
`tag.next_sibling`	Go to the next sibling	`p_tag.next_sibling`
`tag.previous_sibling`	Go to the previous sibling	`p_tag.previous_sibling`
`tag.string.replace_with('new text')`	Replace text inside a tag	`h1.string.replace_with('New Title')`
`tag.decompose()`	Remove the tag from the tree	`p_tag.decompose()`

Retrieve elements

Initialize tree parser: soup = BeautifulSoup(html_doc, 'html.parser')
Get element(s): soup.body.div ...
View formatted element(s): soup.prettify(); soup.body.div.prettify()
Get element attribute: soup.body.div.a.get('href')

Searching the tree:

.find(name, attrs, recursive, string): finds the first matching tag
.find_all(name, attrs, recursive, string): finds all matching tags (returns a list)
.select(css_selector): find tags using CSS selectors (powerful)
.select_one(css_selector): find the first match via CSS selector
.get_text(): extracts all text inside a tag

Modifying the tree:

tag.string.replace_with('new text'): replace text inside a tag
tag.insert(position, new_tag): insert a new tag
tag.decompose(): remove a tag from the tree

Mini example:

from bs4 import BeautifulSoup

html = '<div><p class="title">Title</p><p class="content">Content</p></div>'
soup = BeautifulSoup(html, 'html.parser')

title = soup.find('p', class_='title')     # find a <p> with class="title"
print(title.text)                          # Output: Title

content = soup.select_one('p.content')      # select using CSS selector
print(content.get_text())                   # Output: Content

Soup HTML tree

.tag
- Returns HTML object with the tag selected
- It can be used consecutively to reach a specific tab by following its children
.contents vs .children
- Children of tags can be found in the .content list. Instead of retrieving the list, we may use the .children generator to iterate through a tag’s children.
.descendants
- Recursively returns all the children and their children (all the sub-HTML trees) of the tag
.strings vs .stripped_strings
- .strings returns all strings in the HTML document, including whitespace characters and strings nested within tags
- .stripped_strings returns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
.parent vs .parents
- .parent returns the immediate parent of the current tag
- .parents returns an iterator that allows iterating over all the parents of the current tag.
.next_sibling vs .previous_sibling
- .next_sibling returns the following sibling tag of the current tag
- .previous_sibling returns the previous sibling tag of the current tag.
.next_element vs .previous_element
- .next_element returns the next element in the parse tree after the current element
- .previous_element returns the previous element in the parse tree before the current element.

Web Scraping - robbiehume/CS-Notes GitHub Wiki

Overview

Beautiful Soup

General

Initiate tree

Methods

Retrieve elements

Searching the tree:

Modifying the tree:

Mini example:

Soup HTML tree

Scrapy

Selenium

⚠️ GitHub.com Fallback ⚠️

Web Scraping - robbiehume/CS-Notes GitHub Wiki

Overview

Beautiful Soup

General

Initiate tree

Methods

Retrieve elements

Searching the tree:

Modifying the tree:

Mini example:

Soup HTML tree

Scrapy

Selenium

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️