Web Scraping - robbiehume/CS-Notes GitHub Wiki
- Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents
- It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet
- Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents
-
from bs4 import BeautifulSoup with open('page.html', 'r') as f: html_doc = f.read() soup = BeautifulSoup(html_doc, 'html.parser') # Parse using BeautifulSoup print(soup.body.div.p.text) # print the p element content
-
NOTE: to handle the response decoding process effectively, it is always better to use
.contentinstead of.textwhile using Beautiful Soup-
response = requests.get("https://quotes.toscrape.com/") soup = BeautifulSoup(response.content, 'html.parser')
-
-
Method Purpose Example Usage BeautifulSoup(html, parser)Create a BeautifulSoup object soup = BeautifulSoup(html, 'html.parser')find(name, attrs)Find first matching tag soup.find('p', class_='title')find_all(name, attrs)Find all matching tags soup.find_all('a')select(css_selector)Find tags using CSS selectors soup.select('div > p.title')select_one(css_selector)Find first match using CSS selectors soup.select_one('p.content')tag.textortag.get_text()Extract text inside a tag title.get_text()tag['attribute']Get a specific attribute link['href']tag.attrsGet all attributes of a tag tag.attrstag.parentAccess the parent tag p_tag.parenttag.childrenIterate over direct children for child in div_tag.childrentag.descendantsIterate over all nested children for descendant in div_tag.descendantstag.next_siblingGo to the next sibling p_tag.next_siblingtag.previous_siblingGo to the previous sibling p_tag.previous_siblingtag.string.replace_with('new text')Replace text inside a tag h1.string.replace_with('New Title')tag.decompose()Remove the tag from the tree p_tag.decompose()
-
Initialize tree parser:
soup = BeautifulSoup(html_doc, 'html.parser') -
Get element(s):
soup.body.div ... -
View formatted element(s):
soup.prettify();soup.body.div.prettify() -
Get element attribute:
soup.body.div.a.get('href')
-
.find(name, attrs, recursive, string): finds the first matching tag -
.find_all(name, attrs, recursive, string): finds all matching tags (returns a list) -
.select(css_selector): find tags using CSS selectors (powerful) -
.select_one(css_selector): find the first match via CSS selector -
.get_text(): extracts all text inside a tag
-
tag.string.replace_with('new text'): replace text inside a tag -
tag.insert(position, new_tag): insert a new tag -
tag.decompose(): remove a tag from the tree
-
from bs4 import BeautifulSoup html = '<div><p class="title">Title</p><p class="content">Content</p></div>' soup = BeautifulSoup(html, 'html.parser') title = soup.find('p', class_='title') # find a <p> with class="title" print(title.text) # Output: Title content = soup.select_one('p.content') # select using CSS selector print(content.get_text()) # Output: Content
-
.tag- Returns HTML object with the tag selected
- It can be used consecutively to reach a specific tab by following its children
-
.contentsvs.children- Children of tags can be found in the
.contentlist. Instead of retrieving the list, we may use the.childrengenerator to iterate through a tag’s children.
- Children of tags can be found in the
-
.descendants- Recursively returns all the children and their children (all the sub-HTML trees) of the tag
-
.stringsvs.stripped_strings-
.strings returnsall strings in the HTML document, including whitespace characters and strings nested within tags -
.stripped_stringsreturns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
-
-
.parentvs.parents-
.parentreturns the immediate parent of the current tag -
.parentsreturns an iterator that allows iterating over all the parents of the current tag.
-
-
.next_siblingvs.previous_sibling-
.next_siblingreturns the following sibling tag of the current tag -
.previous_siblingreturns the previous sibling tag of the current tag.
-
-
.next_elementvs.previous_element-
.next_elementreturns the next element in the parse tree after the current element -
.previous_elementreturns the previous element in the parse tree before the current element.
-