Architecture - itsManeka/amazing-scraper GitHub Wiki

Architecture

The library follows Clean Architecture principles: the domain has no external dependencies, ports define interfaces, and infrastructure adapters implement them. All dependencies are injected via constructor.

Module Structure

src/
  domain/
    entities/          Product, CouponInfo, CouponResult, CouponMetadata, FetchPreSalesResult
    errors/            ScraperError
  application/
    ports/             HttpClient, HtmlParser, Logger, RetryPolicy, UserAgentProvider
    use-cases/         FetchProduct, ExtractCouponProducts, FetchPreSales
  infrastructure/
    http/              AxiosHttpClient (axios + tough-cookie), RotatingUserAgentProvider
    parsers/           CheerioHtmlParser (cheerio)
    logger/            ConsoleLogger
    retry/             ExponentialBackoffRetry
  index.ts             Public API and factory (createScraper)

Layers

Domain

Pure entities and error types with no external dependencies. Defines the core data structures (Product, CouponInfo, CouponResult, CouponMetadata, FetchPreSalesResult) and error codes (ScraperError).

Application

Use cases that orchestrate business logic and port interfaces that define contracts for infrastructure adapters:

FetchProduct — fetches a single product page and extracts structured data
ExtractCouponProducts — paginates through coupon promotion API to collect all participating products
FetchPreSales — paginates through HQ & Manga pre-sale search pages to collect ASINs
Ports — HttpClient, HtmlParser, Logger, RetryPolicy, UserAgentProvider

Infrastructure

Concrete implementations of the port interfaces:

AxiosHttpClient — HTTP client with cookie jar support (axios + tough-cookie)
CheerioHtmlParser — HTML parsing and data extraction (cheerio)
ConsoleLogger — default logger implementation
ExponentialBackoffRetry — retry policy with exponential backoff
RotatingUserAgentProvider — rotates browser User-Agent strings

Data Flow

fetchProduct

GET /dp/{ASIN} with browser-like headers
Parse HTML for title, price, stock status, coupon link, and more
Return ProductPage (includes couponInfo when coupon is detected)

extractCouponProducts

GET coupon page, extract anti-CSRF token and metadata
POST to /promotion/psp/productInfoList with pagination
Deduplicate ASINs and guard against infinite loops
Return CouponResult with all products and metadata

fetchPreSales

Build search URL for HQ & Manga pre-sales category
Extract data-asin from search result elements
Paginate with random delays between requests
Stop on: page limit, empty results, stop-ASIN sentinel, or no next page

Built-in Protections

Random delay between requests (configurable)
CAPTCHA detection (3 body markers)
403 retry with backoff on initial page
Session refresh on 403 during pagination
Infinite pagination loop guard via sortId comparison
ASIN deduplication across pages
Configurable maxProducts (1000) and maxPages (500) limits