Managing Hubstorage Crawl Frontiers with Frontera - scrapinghub/shub-workflow GitHub Wiki

Previous Chapter: Managing Hubstorage Crawl Frontiers

Prerequisites

pip install hcf-backend
pip install scrapy-frontera

Generalities

As previously said, we provide an HCF backend for Frontera, which in turn is a framework for development of crawl frontier handlers (backends), by stablishing both a common protocol and a common interface, plus provisioning of some extra facilities. For scrapy spiders, the equation is completed with scrapy-frontera, which provides a scrapy scheduler that supports Frontera without loosing any native behavior from the original scrapy scheduler, so migration of a scrapy spider to Frontera becomes painless.

When working with spiders we can identify two differentiated roles: a producer role and a consumer role, depending on whether the spider writes requests to a requests queue, or read them from it in order to actually process them. Usual scrapy spiders assume both roles at same time transparently: same spider writes requests to a local memory/disk queue, and later reads them from it. We don't usually think in a scrapy spider as a producer/consumer because they work natively in both ways at same time. But that is the logic behind the native scrapy scheduler, which takes care of:

Queuing requests coming from spider, by default into a dual memory/disk queue
Dequeuing requests into the downloader, under specific prioritization rules.

On the other hand, the scheduler implemented in the scrapy-frontera package, which is a subclass of the scrapy scheduler, provides interface with Frontera and allows, either from spider code or configured by project settings, to send specific requests to Frontera backend. In this way, the Frontera backend assumes the responsibility of queuing, dequeuing and prioritization of requests sent to/read from it, while the remaining requests follow the usual flow within Scrapy backend. Still, the scheduler asks for requests from the Frontera backend, adapts them and puts them into their local queues.

So, when using an external frontier, like HCF, we can separate producer and consumer roles into different spiders, and so this division of roles becomes more evident. But whether we separate roles or not, depends on our specific implementation. It is possible to easily adapt a scrapy spider in order to work with HCF with minimal effort, both as producer and consumer, as we will see.

Each ScrapyCloud project has its own HCF space separated from the rest. Within a project, HCF can be subdivided via frontier names, which in turn can be subdivided into slots via slots names. Frontier names and slot names are created on the fly by the producer agents. Urls in HCF are organized into batches. All urls of a batch are read and deleted in block by the consumer. Number of urls per batch are set by the producer, but has a limit of 100.

An essential feature of HCF is request deduplication: any request sent to a given slot that has the same fingerprint than a previous request already sent to the same slot, will be filtered out. HCF ensures that requests with same fingerprint are sent to the same slot. Also,scrapy-frontera ensures that the fingerprint scheme used is the same that scrapy uses internally, so deduplication in HCF will work exactly in the same way as you expect in a spider that does not use HCF. Even more, you can use the dont_filter=True flag in your requests, and the request will not be deduplicated. scrapy-frontera adaptation takes care of that by generating a random fingerprint in this case.

Another important feature of HCF is that writting to a specific slot must be done with no concurrency. That is, only one producer at a time can write batches to a given slot. This limitation provides increased performance to the HCF, but impose some limitations that has to be considered when designing the project. In most cases, it is enough to have a single producer instance writting to all slots, and multiple consumer instances in parallel, each one consuming from a different slot. But in case you need multiple producer instances in parallel, you could enforce that each one sends only urls assigned to a specific slot, and ignore the rest. This will however result in missing urls that were only seen by a single producer, so a working implementation need to be more elaborated, for example by sending instead all urls to a queue processed by a single high throughtput process pipeline, and delegate to it the writting to HCF.

Whether this approach lead to less or more performance than having a single producer, will ultimately depend on the specific application. Most probably, unless you need to implement a very high throughput broad crawler, the single producer approach is faster and use less resources.

hcf-backend comes with a handy tool for managing (deleting, listing, dumping, counting, etc) HCF objects: hcfpal.py:

python -m hcf_backend.utils.hcfpal

If you need to make it available on a project deployed on ScrapyCloud you need, as usual, to define your own script, i.e, scripts/hcfpal.py:

from hcf_backend.utils.hcfpal import HCFPalScript

if __name__ == '__main__':
    from shub_workflow.utils import get_kumo_loglevel
    logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())

    script = HCFPalScript()
    script.run()

Follow command line help for instructions of usage.

Preparing a Scrapy project for a crawl frontier with `scrapy-frontera`

The first step is to add scrapy-frontera and hcf-backend in your project requirements. Then, configure your project to work with scrapy-frontera:

From the scrapy-frontera README, this is the basic configuration you need to add on your project settings.py file in order to replace the native Scrapy scheduler by the scrapy-frontera one:

# shut up frontera DEBUG flooding
import logging                                                                                                                                                                  
logging.getLogger("manager.components").setLevel(logging.INFO)
logging.getLogger("manager").setLevel(logging.INFO)
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)

SCHEDULER = 'scrapy_frontera.scheduler.FronteraScheduler'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0,
}

SPIDER_MIDDLEWARES = {
    'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0,
}

These changes alone will not affect the behavior of a scrapy project, but it will allow to implement spiders that works with HCF.

Producers and consumers need to be configured with different frontera and scrapy settings. So depending on your needs and taste you have two alternatives:

Method 1: Have two separate spiders, one for producer and another for consumer.
Method 2: Have a single spider for both, but configuring settings on runtime.

Method 1 could be enough for simple cases with simple workflow. Method 2 allows to adapt an already developed spider without changing a single spider line code. It works like just plugging an existing spider into a frontier workflow system. But requires the help of extra scripts. Here we will explain the first method. In the next chapter we will explain the second one.

Method 1: Separated producer and consumer

In this example we will implement two spiders, a producer and a consumer. The producer will crawl the target site, extract links and send them to frontier. The consumer will read from the frontier and scrape the provided links.

The producer:

class MySiteLinksSpider(Spider):

    name = 'mysite.com-links'

    frontera_settings = {
        "BACKEND": "hcf_backend.HCFBackend",
        'HCF_PRODUCER_FRONTIER': 'mysite-articles-frontier',
        'HCF_PRODUCER_SLOT_PREFIX': 'test',
        'HCF_PRODUCER_NUMBER_OF_SLOTS': 8,
    }

    custom_settings = {
        'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse_article'],
    }

    def parse_article(self, response):
        pass

    def parse(Self, response):
        (... link discovery logic ...)
        yield Request(..., callback=self.parse_article)

The consumer:

class MySiteArticlesSpider(Spider):
    name = 'mysyte.com-articles'
    frontera_settings = {
        "BACKEND": "hcf_backend.HCFBackend",
        'HCF_CONSUMER_FRONTIER': 'mysite-articles-frontier',
        'HCF_CONSUMER_MAX_BATCHES': 50,
    }

    def parse_article(self, response):
        (... actual implementation of the article parsing method ...)

    (...)

Let's review frontera settings involved in the example:

BACKEND - Sets frontera backend class. In these examples we will always use hcf_backend.HCFBackend
HCF_PRODUCER_FRONTIER and HCF_CONSUMER_FRONTIER - set the HCF frontier name for producer and consumer. For producer and spider for the same set of spiders, it must be the same.
HCF_PRODUCER_SLOT_PREFIX - Sets the prefix of the slots that will be generated. If not provided, it is '' by default.
HCF_PRODUCER_NUMBER_OF_SLOTS - Sets the number of slots. Default value is 1. Slots generated by the producer will have names ranged from {slot prefix}0 to {slot prefix}{number of slots - 1}
HCF_CONSUMER_MAX_BATCHES - Sets the limit of frontier batches that the consumer will process.
HCF_CONSUMER_SLOT - Sets the slot from which the consumer will read frontier batches.

Notice that we are not indicating anywhere in the consumer code the HCF_CONSUMER_SLOT setting. If we only had one slot, this would have sense. However, if we want to run several consumers in parallel, which is the typical use case of using a frontier, we need to pass this setting at run time. The way to do this is via the spider argument frontera_settings_json, for example:

> scrapy crawl mysite.com-articles -a frontera_settings_json='{"HCF_CONSUMER_SLOT": "test3"}'

An aspect you may find weird is the implementation of a method parse_article() that does nothing. This is required because the requests that this spider will send to the frontier, need to reference to a callback with the same name as the one in the mysite.com-articles spider that will process the response. That is, you will create here the requests with Request(..., callback=self.parse_article,...), but the actual implementation of parse_article() is not in this spider, but in mysite.com-articles instead.

For other HCF tunning settings refer to the hcf backend documentation.

The configuration is completed via the scrapy side (not frontera side) setting, FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER. This setting tells the scrapy-frontera scheduler that every request which its callback name is on the provided list (except for the requests generated by start_requests()), will be sent to the frontier.

An alternative to this setting is to add to the requests that has to be sent to the frontier, the meta key 'cf_store': True:

    (...)

    def parse(self, response):
        (...)
        yield Request(..., meta={'cf_store': True})

This requires, however, more intervention in the spider code and less separation of implementation from configuration. But it is up to the taste of the developer which approach to use. cf_store is really conserved for backward compatibility with older versions of scrapy-frontera, and for eventual fine tuning requirements.

Notice that, while hcf settings are added into frontera_settings dict, the last one is added into custom_settings dict. This is because the hcf backend resides on the frontera side, while FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER is a scrapy scheduler setting, and so it is read on scrapy side. Optionally, you can place everything under custom_settings only, but separation of scopes is a good practice.

Running manually and periodically all the consumer jobs or settings lots of periodic jobs is definitively not a convenient approach. Rather we will use the crawl manager provided by the hcf crawlmanager. I.e. save this code into scripts/hcf_crawlmanager.py:

from hcf_backend.utils.crawlmanager import HCFCrawlManager


if __name__ == '__main__':
    from shub_workflow.utils import get_kumo_loglevel
    logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())

    hcf_crawlmanager = HCFCrawlManager()
    hcf_crawlmanager.run()

This crawl manager is a subclass of the generic shub-workflow crawl manager described in the previous chapter. Command line is the same except for some additional positional arguments and options:

$ python hcf_crawlmanager.py --help

usage:
shub-workflow based script (runs on Zyte ScrapyCloud) for easy managing of consumer spiders from multiple slots. It checks available slots and schedules
a job for each one.

       [-h] [--project-id PROJECT_ID] [--name NAME] [--flow-id FLOW_ID] [--children-tag TAG] [--loop-mode SECONDS] [--max-running-jobs MAX_RUNNING_JOBS] [--spider-args SPIDER_ARGS]
       [--job-settings JOB_SETTINGS] [--units UNITS] [--frontera-settings-json FRONTERA_SETTINGS_JSON]
       spider frontier prefix

positional arguments:
  spider                Spider name
  frontier              Frontier name
  prefix                Slot prefix

optional arguments:
  -h, --help            show this help message and exit
  --project-id PROJECT_ID
                        Either numeric id, or entry keyword in scrapinghub.yml. Overrides target project id.
  --name NAME           Script name.
  --flow-id FLOW_ID     If given, use the given flow id.
  --children-tag TAG             Additional tag added to the scheduled jobs. Can be given multiple times.
  --loop-mode SECONDS   If provided, manager will run in loop mode, with a cycle each given number of seconds. Default: 0
  --max-running-jobs MAX_RUNNING_JOBS
                        If given, don't allow more than the given jobs running at once. Default: 1
  --spider-args SPIDER_ARGS
                        Spider arguments dict in json format
  --job-settings JOB_SETTINGS
                        Job settings dict in json format
  --units UNITS         Set number of ScrapyCloud units for each job
  --frontera-settings-json FRONTERA_SETTINGS_JSON

Here we add two positonal arguments: frontier and slot prefixand the option --frontera-settings-json. By using this crawl manager we don't even need to provide any frontera settings in the consumer code, so we can remove all them, and pass eveything via the crawl manager:

$ python hcf_crawlmanager.py mysite.com-articles mysite-articles-frontier test --frontera-settings-json='{"HCF_CONSUMER_MAX_BATCHES": 50, "BACKEND": "hcf_backend.HCFBackend"}' --max-running-jobs=8 --loop-mode=60

The manager will take care of scheduling up to 8 jobs in parallel, one per slot, as we also set to 8 the frontera setting HCF_PRODUCER_NUMBER_OF_SLOTS in the producer. Once a job completes the processing of 50 batches, a free slot is available to schedule a new consumer job and the crawl manager will do that and repeat in cycles, until all the batches are consumed.

When invoked in this way, by deault, the spiders will be scheduled in the SC project defined by the default entry in the scrapinghub.yml file. The target project can be overriden either via the PROJECT_ID environment variable, or more explicitly by adding the option --project-id in the command line invokation. If the script is invoked in the Scrapy Cloud itself, by default the spiders will be scheduled in the same project where the script is running, but can be overriden by --project-id option, which allows cross project scheduling.

Alternatively you can provide some hard coded default parameters in the hcf crawl manager script itself:

from hcf_backend.utils.crawlmanager import HCFCrawlManager


class MyHCFCrawlManager(HCFCrawlManager):
    loop_mode = 60
    default_max_jobs = 8


if __name__ == '__main__':
    from shub_workflow.utils import get_kumo_loglevel
    logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())

    hcf_crawlmanager = MyHCFCrawlManager()
    hcf_crawlmanager.run()

So the command line can be shorter:

$ python hcf_crawlmanager.py mysite.com-articles mysite-articles-frontier test --frontera-settings-json='{"HCF_CONSUMER_MAX_BATCHES": 50, "BACKEND": "hcf_backend.HCFBackend"}'

Method 2: Single spider, all frontier configuration provided externally on runtime.

The method 1 is rather for demonstrative purpose. In the real case, however, we will prefer instead to write or reuse a spider that doesn't know anything about the crawl frontier, and just plug it into a workflow in order to reproduce the same behavior illustrated in the first method without changing any line of the spider code (or at most, just stablish some standard with no reference to frontier at all). In the next chapter we will explain how to use graph manager for this purpose.

Next Chapter: Graph Managers with Frontera