Managing Hubstorage Crawl Frontiers - scrapinghub/shub-workflow GitHub Wiki

Previous Chapter: Graph Managers


Adding stability, crawl resumability and traceability with a crawl frontier

For big crawls, a single spider job is usually problematic. If the spider is stopped for any reason, you have to recrawl from zero, thus loosing time and costs in resources. If you are reading this documentation, I don't have to explain all the pain that stops mean with huge crawls when you run a single job crawl: you are here most probably because you are searching for a better approach for your project.

A crawl frontier is a store of requests that can be filled and consumed progressively by different processes. ScrapyCloud provides a native crawl frontier, Hubstorage Crawl Frontier (HCF).

There are a couple of related libraries that frequently work together with shub-workflow, because scrapy spiders workflows usually relies on HCF capabilities:

Even more, hcf-backend provides a crawl manager subclassed from shub-workflow base crawl manager class, which facilitates the scheduling of consumer spiders (spiders that consumes requests from a frontier) and can be one task of a workflow. In the present tutorial we will exemplify the usage of them too.

However, you can use HCF without using frontera/scrapy-Frontera/hcf-backend at all, which in many situations adds too much complexity and is not well suited for every case.

In addition, workflows defined with shub-workflow are not limited to the usage of HCF. Any storage technology can be used and mixed, and in practice it is being used for coordination of workflow pipelines with spiders and post processing scripts running on ScrapyCloud, using storage technologies like S3 or GCS for massive data exchange between them. The library also provides utils for working conveniently with those technologies in the context of the workflow pipelines built with it.

So, in the following chapters we will explain different alternatives on how to use shub-workflow for controlling a workflow that relies on HCF.

The following one uses Frontera suite (Frontera + hcf-backend + scrapy-frontera):

Managing Hubstorage Crawl Frontiers with Frontera

The modern, recommended approach is described starting the following document:

Managing Hubstorage Crawl Frontiers ‐ The Modern way