Managing Hubstorage Crawl Frontiers with Frontera - scrapinghub/shub-workflow GitHub Wiki
Previous Chapter: Managing Hubstorage Crawl Frontiers
Prerequisites
pip install hcf-backend
pip install scrapy-frontera
Generalities
As previously said, we provide an HCF backend for Frontera, which in turn is a framework for development of crawl
frontier handlers (backends), by stablishing both a common protocol and a common interface, plus provisioning of some extra facilities. For scrapy spiders,
the equation is completed with scrapy-frontera, which provides a scrapy scheduler that supports Frontera
without loosing any native
behavior from the original scrapy scheduler, so migration of a scrapy spider to Frontera
becomes painless.
When working with spiders we can identify two differentiated roles: a producer role and a consumer role, depending on whether the spider writes requests to a requests queue, or read them from it in order to actually process them. Usual scrapy spiders assume both roles at same time transparently: same spider writes requests to a local memory/disk queue, and later reads them from it. We don't usually think in a scrapy spider as a producer/consumer because they work natively in both ways at same time. But that is the logic behind the native scrapy scheduler, which takes care of:
- Queuing requests coming from spider, by default into a dual memory/disk queue
- Dequeuing requests into the downloader, under specific prioritization rules.
On the other hand, the scheduler implemented in the scrapy-frontera
package, which is a subclass
of the scrapy scheduler, provides interface with Frontera
and allows, either from spider code or configured by project settings, to send specific requests to Frontera
backend.
In this way, the Frontera
backend assumes the responsibility of queuing, dequeuing and prioritization of requests sent to/read from it, while the remaining requests follow the usual
flow within Scrapy backend. Still, the scheduler asks for requests from the Frontera
backend, adapts them and puts them into their local queues.
So, when using an external frontier, like HCF, we can separate producer and consumer roles into different spiders, and so this division of roles becomes more evident. But whether we separate roles or not, depends on our specific implementation. It is possible to easily adapt a scrapy spider in order to work with HCF with minimal effort, both as producer and consumer, as we will see.
Each ScrapyCloud project has its own HCF space separated from the rest. Within a project, HCF can be subdivided via frontier names, which in turn can be subdivided into slots via slots names.
Frontier names and slot names are created on the fly by the producer agents. Urls in HCF are organized into batches. All urls of a batch are read and deleted in block by the consumer. Number
of urls per batch are set by the producer, but has a limit of 100
.
An essential feature of HCF is request deduplication: any request sent to a given slot that has the same fingerprint than a previous request already sent to the same slot, will be
filtered out. HCF ensures that requests with same fingerprint are sent to the same slot.
Also,scrapy-frontera
ensures that the fingerprint scheme used is the same that scrapy uses internally, so deduplication in HCF will work exactly in the same way as you expect
in a spider that does not use HCF. Even more, you can use the dont_filter=True
flag in your requests, and the request will not be deduplicated. scrapy-frontera
adaptation
takes care of that by generating a random fingerprint in this case.
Another important feature of HCF is that writting to a specific slot must be done with no concurrency. That is, only one producer at a time can write batches to a given slot. This limitation provides increased performance to the HCF, but impose some limitations that has to be considered when designing the project. In most cases, it is enough to have a single producer instance writting to all slots, and multiple consumer instances in parallel, each one consuming from a different slot. But in case you need multiple producer instances in parallel, you could enforce that each one sends only urls assigned to a specific slot, and ignore the rest. This will however result in missing urls that were only seen by a single producer, so a working implementation need to be more elaborated, for example by sending instead all urls to a queue processed by a single high throughtput process pipeline, and delegate to it the writting to HCF.
Whether this approach lead to less or more performance than having a single producer, will ultimately depend on the specific application. Most probably, unless you need to implement a very high throughput broad crawler, the single producer approach is faster and use less resources.
hcf-backend
comes with a handy tool for managing (deleting, listing, dumping, counting, etc) HCF objects:
hcfpal.py:
python -m hcf_backend.utils.hcfpal
If you need to make it available on a project deployed on ScrapyCloud you need, as usual, to define your own script, i.e, scripts/hcfpal.py
:
from hcf_backend.utils.hcfpal import HCFPalScript
if __name__ == '__main__':
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
script = HCFPalScript()
script.run()
Follow command line help for instructions of usage.
scrapy-frontera
Preparing a Scrapy project for a crawl frontier with The first step is to add scrapy-frontera and hcf-backend in your project requirements. Then, configure your project to work with scrapy-frontera:
From the scrapy-frontera README, this is the basic configuration you need to add on
your project settings.py
file in order to replace the native Scrapy scheduler by the scrapy-frontera
one:
# shut up frontera DEBUG flooding
import logging
logging.getLogger("manager.components").setLevel(logging.INFO)
logging.getLogger("manager").setLevel(logging.INFO)
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)
SCHEDULER = 'scrapy_frontera.scheduler.FronteraScheduler'
DOWNLOADER_MIDDLEWARES = {
'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0,
}
SPIDER_MIDDLEWARES = {
'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0,
}
These changes alone will not affect the behavior of a scrapy project, but it will allow to implement spiders that works with HCF.
Producers and consumers need to be configured with different frontera and scrapy settings. So depending on your needs and taste you have two alternatives:
- Method 1: Have two separate spiders, one for producer and another for consumer.
- Method 2: Have a single spider for both, but configuring settings on runtime.
Method 1 could be enough for simple cases with simple workflow. Method 2 allows to adapt an already developed spider without changing a single spider line code. It works like just plugging an existing spider into a frontier workflow system. But requires the help of extra scripts. Here we will explain the first method. In the next chapter we will explain the second one.
Method 1: Separated producer and consumer
In this example we will implement two spiders, a producer and a consumer. The producer will crawl the target site, extract links and send them to frontier. The consumer will read from the frontier and scrape the provided links.
The producer:
class MySiteLinksSpider(Spider):
name = 'mysite.com-links'
frontera_settings = {
"BACKEND": "hcf_backend.HCFBackend",
'HCF_PRODUCER_FRONTIER': 'mysite-articles-frontier',
'HCF_PRODUCER_SLOT_PREFIX': 'test',
'HCF_PRODUCER_NUMBER_OF_SLOTS': 8,
}
custom_settings = {
'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse_article'],
}
def parse_article(self, response):
pass
def parse(Self, response):
(... link discovery logic ...)
yield Request(..., callback=self.parse_article)
The consumer:
class MySiteArticlesSpider(Spider):
name = 'mysyte.com-articles'
frontera_settings = {
"BACKEND": "hcf_backend.HCFBackend",
'HCF_CONSUMER_FRONTIER': 'mysite-articles-frontier',
'HCF_CONSUMER_MAX_BATCHES': 50,
}
def parse_article(self, response):
(... actual implementation of the article parsing method ...)
(...)
Let's review frontera settings involved in the example:
BACKEND
- Sets frontera backend class. In these examples we will always usehcf_backend.HCFBackend
HCF_PRODUCER_FRONTIER
andHCF_CONSUMER_FRONTIER
- set the HCF frontier name for producer and consumer. For producer and spider for the same set of spiders, it must be the same.HCF_PRODUCER_SLOT_PREFIX
- Sets the prefix of the slots that will be generated. If not provided, it is''
by default.HCF_PRODUCER_NUMBER_OF_SLOTS
- Sets the number of slots. Default value is1
. Slots generated by the producer will have names ranged from{slot prefix}0
to{slot prefix}{number of slots - 1}
HCF_CONSUMER_MAX_BATCHES
- Sets the limit of frontier batches that the consumer will process.HCF_CONSUMER_SLOT
- Sets the slot from which the consumer will read frontier batches.
Notice that we are not indicating anywhere in the consumer code the HCF_CONSUMER_SLOT
setting. If we only had one slot, this would have sense. However, if we want to
run several consumers in parallel, which is the typical use case of using a frontier, we need to pass this setting at run time. The way to do this is via
the spider argument frontera_settings_json
, for example:
> scrapy crawl mysite.com-articles -a frontera_settings_json='{"HCF_CONSUMER_SLOT": "test3"}'
An aspect you may find weird is the implementation of a method parse_article()
that does nothing. This is required because the requests that this spider will send to the frontier,
need to reference to a callback with the same name as the one in the mysite.com-articles
spider that will process the response. That is, you
will create here the requests with Request(..., callback=self.parse_article,...)
, but the actual implementation of parse_article()
is not in this spider, but in mysite.com-articles
instead.
For other HCF tunning settings refer to the hcf backend documentation.
The configuration is completed via the scrapy side (not frontera side) setting, FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER
. This setting tells the scrapy-frontera scheduler
that every request which its callback name is on the provided list (except for the requests generated by start_requests()
), will be sent to the frontier.
An alternative to this setting is to add to the requests that has to be sent to the frontier, the meta key 'cf_store': True
:
(...)
def parse(self, response):
(...)
yield Request(..., meta={'cf_store': True})
This requires, however, more intervention in the spider code and less separation of implementation from configuration. But it is up to the taste of the developer which approach to use.
cf_store
is really conserved for backward compatibility with older versions of scrapy-frontera, and for eventual fine tuning requirements.
Notice that, while hcf settings are added into frontera_settings
dict, the last one is added into custom_settings
dict.
This is because the hcf backend resides on the frontera side, while FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER
is a scrapy scheduler setting, and so it is read on scrapy
side. Optionally, you can place everything under custom_settings
only, but separation of scopes is a good practice.
Running manually and periodically all the consumer jobs or settings lots of periodic jobs is definitively not a convenient approach. Rather we will use the crawl manager provided
by the hcf crawlmanager. I.e. save this code into scripts/hcf_crawlmanager.py
:
from hcf_backend.utils.crawlmanager import HCFCrawlManager
if __name__ == '__main__':
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
hcf_crawlmanager = HCFCrawlManager()
hcf_crawlmanager.run()
This crawl manager is a subclass of the generic shub-workflow crawl manager described in the previous chapter. Command line is the same except for some additional positional arguments and options:
$ python hcf_crawlmanager.py --help
usage:
shub-workflow based script (runs on Zyte ScrapyCloud) for easy managing of consumer spiders from multiple slots. It checks available slots and schedules
a job for each one.
[-h] [--project-id PROJECT_ID] [--name NAME] [--flow-id FLOW_ID] [--children-tag TAG] [--loop-mode SECONDS] [--max-running-jobs MAX_RUNNING_JOBS] [--spider-args SPIDER_ARGS]
[--job-settings JOB_SETTINGS] [--units UNITS] [--frontera-settings-json FRONTERA_SETTINGS_JSON]
spider frontier prefix
positional arguments:
spider Spider name
frontier Frontier name
prefix Slot prefix
optional arguments:
-h, --help show this help message and exit
--project-id PROJECT_ID
Either numeric id, or entry keyword in scrapinghub.yml. Overrides target project id.
--name NAME Script name.
--flow-id FLOW_ID If given, use the given flow id.
--children-tag TAG Additional tag added to the scheduled jobs. Can be given multiple times.
--loop-mode SECONDS If provided, manager will run in loop mode, with a cycle each given number of seconds. Default: 0
--max-running-jobs MAX_RUNNING_JOBS
If given, don't allow more than the given jobs running at once. Default: 1
--spider-args SPIDER_ARGS
Spider arguments dict in json format
--job-settings JOB_SETTINGS
Job settings dict in json format
--units UNITS Set number of ScrapyCloud units for each job
--frontera-settings-json FRONTERA_SETTINGS_JSON
Here we add two positonal arguments: frontier
and slot prefix
and the option --frontera-settings-json
. By using this
crawl manager we don't even need to provide any frontera settings in the consumer code, so we can remove all them, and pass eveything via the crawl manager:
$ python hcf_crawlmanager.py mysite.com-articles mysite-articles-frontier test --frontera-settings-json='{"HCF_CONSUMER_MAX_BATCHES": 50, "BACKEND": "hcf_backend.HCFBackend"}' --max-running-jobs=8 --loop-mode=60
The manager will take care of scheduling up to 8 jobs in parallel, one per slot, as we also set to 8 the frontera setting HCF_PRODUCER_NUMBER_OF_SLOTS
in the producer.
Once a job completes the processing of 50 batches, a free slot is available to schedule a new consumer job and the crawl manager will do that and repeat in cycles, until
all the batches are consumed.
When invoked in this way, by deault, the spiders will be scheduled in the SC project defined by the default
entry in the scrapinghub.yml
file. The target project can be
overriden either via the PROJECT_ID
environment variable, or more explicitly by adding the option --project-id
in the command line invokation. If the script is invoked
in the Scrapy Cloud itself, by default the spiders will be scheduled in the same project where the script is running, but can be overriden by --project-id
option, which
allows cross project scheduling.
Alternatively you can provide some hard coded default parameters in the hcf crawl manager script itself:
from hcf_backend.utils.crawlmanager import HCFCrawlManager
class MyHCFCrawlManager(HCFCrawlManager):
loop_mode = 60
default_max_jobs = 8
if __name__ == '__main__':
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
hcf_crawlmanager = MyHCFCrawlManager()
hcf_crawlmanager.run()
So the command line can be shorter:
$ python hcf_crawlmanager.py mysite.com-articles mysite-articles-frontier test --frontera-settings-json='{"HCF_CONSUMER_MAX_BATCHES": 50, "BACKEND": "hcf_backend.HCFBackend"}'
Method 2: Single spider, all frontier configuration provided externally on runtime.
The method 1 is rather for demonstrative purpose. In the real case, however, we will prefer instead to write or reuse a spider that doesn't know anything about the crawl frontier, and just plug it into a workflow in order to reproduce the same behavior illustrated in the first method without changing any line of the spider code (or at most, just stablish some standard with no reference to frontier at all). In the next chapter we will explain how to use graph manager for this purpose.
Next Chapter: Graph Managers with Frontera