Introduction - TheWebSolver/scraper GitHub Wiki

Scraper provides various interfaces to scrape, trace, and parse data from the internet.

💡 Before using this package, consider the website's terms of use, robots.txt, and similar other structures for scraping the data.

⛁ To help mitigate the issue of hitting rate-limit imposed by the website where content is being scraped, the Scrapable interface provides caching methods to store and retrieve the scraped content.

👉 Either way, it is highly recommended to cache scraped data to prevent frequent visit to source URI.

Installation

Install library using composer command:

$ composer require thewebsolver/scraper

Structure

Interfaces

Besides Scrapable interface, this library also provides interfaces to trace and infer data based on DOM Node or simply from string. One of them is Table Tracer which traces a Table structure and infers data from its child structures.

To support maximum interoperability, this library is created following Interface Segregation principle where developer is free to create concretes how they want.

Attributes

To scrape data from URL, concrete class must use ScrapeFrom attribute to define the source URL and cache filename (with extension, .html is recommended).

For tracing content that supports a collection of dataset such as Table Columns (<td>), List (<li>), etc., CollectFrom attribute may be used to provide indices for mapping with the collected data.

Transformers

When tracing data, transformers may be used to transform the scraped data currently being traced. Transformer is scope-aware. Meaning, it can only be used within the object scope where a certain behavior is needed.

Besides base transformers, aptly named Marshaller (inside Marshaller directory), there are other transformers that may extend behavior of marshaller that are inside following directories:

Decorator: Accepts base transformer to attach additional behavior
Proxy: Uses one or more transformers to handle complex transformation on its own