Getting Started with Web Scraping - jonathancolmer/lab-guide GitHub Wiki

Introduction to Web Scraping

Web scraping allows you to programmatically extract data from websites. This guide provides an introduction to HTML (the code behind webpages), scraping techniques used in Python, and a guide to integrating help from generative AI into the process.


HTML Basics

Most websites are built with HTML (HyperText Markup Language), which structures the content you see on a webpage. Scraping involves parsing this HTML to locate and extract the information you want. HTML is comprised of elements that act like building blocks, each wrapped in tags that describe their purpose and position on the page.

Key HTML Elements

  • <tag>: HTML elements are wrapped in tags.
  • <div>: A container element often used for layout.
  • <table>: Defines a table.
  • <tr>: Table row.
  • <td>: Table data (cell).
  • <a>: Anchor tag (used for hyperlinks).
  • <span>, <p>, <h1>...: Text formatting and structuring.

You can right-click on a webpage and select “Inspect” to open the browser’s Developer Tools and explore the HTML structure.


Requests

Requests is a Python library for sending HTTP requests and receiving responses. It is commonly used in web scraping to download HTML content from webpages.

In a typical scraping workflow, requests.get() retrieves a page by URL, and the response object provides access to the HTML (response.text), status code, headers, and more. This HTML can then be parsed with a tool like BeautifulSoup. In addition to retrieving data, requests.post() can be used to send data to a server—useful for submitting forms or interacting with APIs.

Common Methods

  • requests.get(url) – Sends a GET request to the specified URL and returns a response.
  • response.status_code – Returns the HTTP status code (e.g., 200 for OK, 400 for bad request, 403 for forbidden, 404 for not found).
  • response.text – Returns the response content as a string (usually HTML).
  • response.content – Returns the response content as bytes.
  • response.headers – Returns the response headers as a dictionary.
  • response.json() – Parses the response as JSON (if applicable).
  • requests.post(url, data=...) – Sends a POST request with form data.

Basic requests example code can be found here.

Headers, Cookies, and Sessions

Many websites expect requests to include headers — especially a User-Agent — to identify the client making the request. Without these, a request might be blocked or return different content. Headers can be added using the headers argument in requests.get() or requests.post().

Cookies are small pieces of data that websites store in the browser to maintain sessions (e.g., for login). The requests library automatically handles cookies, but they can also be set manually.

To persist headers, cookies, and other settings across multiple requests, use a requests.Session() object. This is useful for scraping websites that require login or maintain state.

Handling Errors and Timeouts

When making HTTP requests, it's common to encounter issues such as broken links, server errors, or long delays. The requests library provides tools to handle these situations gracefully.

Use a try/except block to catch exceptions like connection errors, timeouts, or invalid responses. The raise_for_status() method is useful for catching HTTP errors (like 403 or 500), and the timeout parameter prevents requests from hanging indefinitely.

Example code with error handling, headers, and sessions can be found here.


BeautifulSoup

BeautifulSoup is a Python library that parses HTML or XML and turns it into a searchable tree structure. It provides tools to find tags, extract text, and retrieve attributes using simple Python commands. BeautifulSoup is commonly used to navigate a webpage’s structure and locate specific elements such as headlines, links, or table rows.

In a typical data scraping workflow, the process often starts by using a tool like requests to download the HTML content of a webpage. That HTML can then be passed to BeautifulSoup to enable structured searching and extraction. Once the desired data is extracted, it can be formatted and saved to a CSV file or another storage format.

Common Methods

  • soup.find(tag) – Finds the first occurrence of the specified tag.
  • soup.find_all(tag) – Finds all occurrences of the specified tag.
  • soup.select(css_selector) – Finds elements using CSS selectors.
  • soup.get_text() – Extracts all visible text from the HTML document.
  • element.text – Gets the text inside a specific element.
  • element['href'] – Gets the value of an attribute (e.g., link URL).
  • element.get('class') – Safely gets the value of the class attribute.
  • element.name – Returns the name of the tag (e.g., "a" or "div").
  • element.attrs – Returns all attributes of an element as a dictionary.
  • soup.prettify() – Formats the HTML with indentation for easier reading.

Example code using BeautifulSoup can be found here.


Selenium

Selenium is a Python library that automates browser actions like clicking, scrolling, and form submission. It controls a real or headless browser through a WebDriver, allowing interaction with JavaScript-rendered pages that can't be accessed with requests alone.

In a data scraping workflow, Selenium is typically used when the target webpage relies heavily on JavaScript to load content. A WebDriver (e.g., ChromeDriver or GeckoDriver) launches a browser session, and Selenium commands can simulate user behavior to fully load and extract the desired data. Once the page is rendered, the page source can be passed to BeautifulSoup for parsing.

Common Methods

  • webdriver.Chrome() – Launches a new Chrome browser session.
  • driver.get(url) – Loads the specified URL in the browser.
  • driver.page_source – Returns the full HTML of the loaded page.
  • driver.find_element(By.ID, ...) – Locates a single element by ID.
  • driver.find_elements(By.CLASS_NAME, ...) – Locates multiple elements by class name.
  • element.click() – Clicks an element.
  • element.text – Extracts the visible text of an element.
  • element.get_attribute('href') – Gets an attribute from an element.
  • driver.quit() – Closes the browser session.

When to Use Selenium vs. Requests + BeautifulSoup

  • Use requests + BeautifulSoup for static pages where content is directly available in the HTML source.
  • Use Selenium when content is loaded dynamically via JavaScript, such as after scrolling, clicking, or waiting for API responses.
  • Selenium is slower and more resource-intensive but essential for sites that require clicking or scrolling.
  • A common pattern is to use Selenium to load the page, then pass driver.page_source to BeautifulSoup for parsing.

Example code using Selenium can be found here.


Installation

Each of the libraries used in this guide can be installed using pip, which comes with most Python installations. Open a terminal and run the following commands:

Install Required Libraries

pip install requests
pip install beautifulsoup4
pip install lxml html5lib
pip install selenium
  • requests – For sending HTTP requests and retrieving HTML content.
  • beautifulsoup4 – For parsing and navigating HTML documents.
  • lxml, html5lib – Optional HTML parsers used by BeautifulSoup (recommended for better performance and compatibility).
  • selenium – For automating browser interactions and scraping JavaScript-rendered content.

Install WebDriver for Selenium

Selenium requires a WebDriver to control a browser. Download the appropriate driver for your browser and make sure it is installed in a known location or added to your system PATH:

Platform Notes

  • Windows: Use Command Prompt or PowerShell to run the above commands.
  • macOS: Use the built-in Terminal app.

If pip is not recognized, make sure Python and pip are correctly installed and added to your system PATH.


Additional Tips for Responsible Scraping

  • Check the site’s robots.txt file to see what’s allowed: https://example.com/robots.txt
  • Be respectful with delays (time.sleep()) and don’t overload servers.
  • Avoid scraping personal or sensitive data.

Using Generative AI for Data Scraping

In practice, generative AI can be immensely helpful in writing scripts for data scraping. The following workflow may be helpful when using AI develop code for web scraping:

  1. Outline the goal of the script -- which webpages need to be visited, what data needs to be scraped, and how you'd like to store the outputs.
  2. Enter this outline into a generative AI model, along with all relevant HTML code and links to webpages. When copying HTML, use "copy OuterHTML" to ensure you capture all nested elements. Ask the AI to write the script.
  3. Test the script and see if it works.
  4. If the script doesn't work, inspect the sections causing the issue and use AI to troubleshoot by sending it smaller snippets of HTML code relevant to each specific section.
⚠️ **GitHub.com Fallback** ⚠️