Web Scraping - robbiehume/CS-Notes GitHub Wiki

Overview

Beautiful Soup

General

  • Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents
  • It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet
  • Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents

Initiate tree

  • from bs4 import BeautifulSoup
    
    with open('page.html', 'r') as f:
        html_doc = f.read()
    soup = BeautifulSoup(html_doc, 'html.parser')  # Parse using BeautifulSoup
    print(soup.body.div.p.text)  # print the p element content
  • NOTE: to handle the response decoding process effectively, it is always better to use .content instead of .text while using Beautiful Soup
    • response = requests.get("https://quotes.toscrape.com/")
      soup = BeautifulSoup(response.content, 'html.parser')

Methods

  • Method Purpose Example Usage
    BeautifulSoup(html, parser) Create a BeautifulSoup object soup = BeautifulSoup(html, 'html.parser')
    find(name, attrs) Find first matching tag soup.find('p', class_='title')
    find_all(name, attrs) Find all matching tags soup.find_all('a')
    select(css_selector) Find tags using CSS selectors soup.select('div > p.title')
    select_one(css_selector) Find first match using CSS selectors soup.select_one('p.content')
    tag.text or tag.get_text() Extract text inside a tag title.get_text()
    tag['attribute'] Get a specific attribute link['href']
    tag.attrs Get all attributes of a tag tag.attrs
    tag.parent Access the parent tag p_tag.parent
    tag.children Iterate over direct children for child in div_tag.children
    tag.descendants Iterate over all nested children for descendant in div_tag.descendants
    tag.next_sibling Go to the next sibling p_tag.next_sibling
    tag.previous_sibling Go to the previous sibling p_tag.previous_sibling
    tag.string.replace_with('new text') Replace text inside a tag h1.string.replace_with('New Title')
    tag.decompose() Remove the tag from the tree p_tag.decompose()

Retrieve elements

  • Initialize tree parser: soup = BeautifulSoup(html_doc, 'html.parser')
  • Get element(s): soup.body.div ...
  • View formatted element(s): soup.prettify(); soup.body.div.prettify()
  • Get element attribute: soup.body.div.a.get('href')

Searching the tree:

  • .find(name, attrs, recursive, string): finds the first matching tag
  • .find_all(name, attrs, recursive, string): finds all matching tags (returns a list)
  • .select(css_selector): find tags using CSS selectors (powerful)
  • .select_one(css_selector): find the first match via CSS selector
  • .get_text(): extracts all text inside a tag

Modifying the tree:

  • tag.string.replace_with('new text'): replace text inside a tag
  • tag.insert(position, new_tag): insert a new tag
  • tag.decompose(): remove a tag from the tree

Mini example:

  • from bs4 import BeautifulSoup
    
    html = '<div><p class="title">Title</p><p class="content">Content</p></div>'
    soup = BeautifulSoup(html, 'html.parser')
    
    title = soup.find('p', class_='title')     # find a <p> with class="title"
    print(title.text)                          # Output: Title
    
    content = soup.select_one('p.content')      # select using CSS selector
    print(content.get_text())                   # Output: Content

Soup HTML tree

  • .tag
    • Returns HTML object with the tag selected
    • It can be used consecutively to reach a specific tab by following its children
  • .contents vs .children
    • Children of tags can be found in the .content list. Instead of retrieving the list, we may use the .children generator to iterate through a tag’s children.
  • .descendants
    • Recursively returns all the children and their children (all the sub-HTML trees) of the tag
  • .strings vs .stripped_strings
    • .strings returns all strings in the HTML document, including whitespace characters and strings nested within tags
    • .stripped_strings returns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
  • .parent vs .parents
    • .parent returns the immediate parent of the current tag
    • .parents returns an iterator that allows iterating over all the parents of the current tag.
  • .next_sibling vs .previous_sibling
    • .next_sibling returns the following sibling tag of the current tag
    • .previous_sibling returns the previous sibling tag of the current tag.
  • .next_element vs .previous_element
    • .next_element returns the next element in the parse tree after the current element
    • .previous_element returns the previous element in the parse tree before the current element.

Scrapy

Selenium

⚠️ **GitHub.com Fallback** ⚠️