Crawl Managers - scrapinghub/shub-workflow GitHub Wiki
Previous Chapter: Credentials Setup
The simplest workflow can be defined with the CrawlManager class. This class schedules a
single spider job. Not much useful by itself, but it helps to illustrate basic concepts.
The first step is to create a crawl manager script in your project repository for deploying in ScrapyCloud. Save the following lines in a file called, for example,
script/crawlmanager.py
:
from shub_workflow.crawl import CrawlManager
if __name__ == '__main__':
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
crawlmanager = CrawlManager()
crawlmanager.run()
and add a proper scripts line on your project setup.py
. For example:
from setuptools import setup, find_packages
setup(
name = 'project',
version = '1.0',
packages = find_packages(),
scripts = ['scripts/crawlmanager.py'],
entry_points = {'scrapy': ['settings = myproject.settings']}
)
Let's analyze the help printed when the script is called without parameters from command line:
> python crawlmanager.py -h
usage: You didn't set description for this script. Please set description property accordingly.
[-h] [--project-id PROJECT_ID] [--name NAME] [--flow-id FLOW_ID] [--children-tag TAG]
[--loop-mode SECONDS] [--max-running-jobs MAX_RUNNING_JOBS]
[--spider-args SPIDER_ARGS] [--job-settings JOB_SETTINGS]
[--units UNITS]
spider
positional arguments:
spider Spider name
optional arguments:
-h, --help show this help message and exit
--project-id PROJECT_ID
Either numeric id, or entry keyword in scrapinghub.yml. Overrides target project id.
--name NAME Script name.
--flow-id FLOW_ID If given, use the given flow id.
--children-tag TAG Add given tag to the scheduled jobs. Can be given
multiple times.
--loop-mode SECONDS If provided, manager will run in loop mode, with a
cycle each given number of seconds. Default: 0
--max-running-jobs MAX_RUNNING_JOBS
If given, don't allow more than the given jobs running
at once. Default: inf
--spider-args SPIDER_ARGS
Spider arguments dict in json format
--job-settings JOB_SETTINGS
Job settings dict in json format
--units UNITS Set number of ScrapyCloud units for each job
Some of the options are inherited from parent classes, other ones are added by CrawlManager
class. A first message that may grab your attention, is the initial
description message: You didn't set description for this script. Please set description property accordingly.
. Every script subclassed from
base script class will print this message if a description for it (or a parent
class) was not created. For creating it you have to add the property description
. In our example, it could be something like this:
from shub_workflow.crawl import CrawlManager as SHCrawlManager
class CrawlManager(SHCrawlManager):
@property
def description(self):
return 'Crawl manager for MyProject.'
if __name__ == '__main__':
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
crawlmanager = CrawlManager()
crawlmanager.run()
Let's focus on the command line options and arguments. The first options are inherited from the base script class.
--project-id
When a shub-workflow script runs in ScrapyCloud, the project id where it operates is autodetected: by default it is the id of the ScrapyCloud project where the script itself is running.
In the context of a script that schedules other jobs (from now on, a manager script), like our crawl manager, this project id determines the target project where these
children jobs must run. But for some applications you may want to run jobs in a different project than the one where the manager is running. So you can provide --project-id
option for those cases. Also, it is possible to run the manager outside scrapy cloud. In this case, project id cannot be autodetected, so you must provide it either with the
--project-id
option, or the PROJECT_ID
environment variable.
When a shub-workflow script is invoked on command line, it tries to guess the project id from default
entry in project scrapinghub.yml
. For overriding or providing when such entry is not available,
use either the PROJECT_ID
environment variable, or the --project-id
command line option)
--name
The --name
option overrides the manager attribute name
. This attribute allows to assign a workflow name to the script. The same script can run in the context of many different workflows
(not only instances of the same workflow), and a name identification is useful in many situations, like recognizing owned jobs on workflow resuming. In particular, any object derived from
WorkFlowManager
requires a name. Either as class attribute, or passed via command line. In addition, different scripts that may run on the same workflow, must have different names.
--flow-id
The flow id identifies a specific instance of a workflow. If this option is not provided, it is autogenerated and added to the job tags of the manager script itself, and propagated to all its scheduled children. In this way different jobs running in ScrapyCloud can be related to the same instance of a workflow, and allows consistency between different jobs running on it in ways that we will see later. You may want also to override the flow id via command line when resuming jobs, for example, or for manually scheduling jobs associated to a specific workflow instance.
--children-tag
The --children-tag
command line option allows to add custom tags to the children jobs.
--loop-mode
By default, a workflow manager script performs a single loop and exits. The crawl manager for example, will schedule a spider job and finish. But if you set loop mode, it will continue alive, looping each every given seconds, and checking on each loop the status of the scheduled job. Once the job is finished, the crawl manager finishes too. Not much useful for this crawl manager. Most workflows however, need its manager to work in loop mode, for scheduling new jobs as previous ones finishes, monitoring the status of the workflow, etc. In order the crawl manager script to work in loop mode, you can either:
-
In your custom crawl manager class, set the class attribute
loop_mode
to an integer that determines the number of seconds that manager must sleep between each loop execution (except if you setloop_mode = 0
, which is the default and disables looping). -
You can override the default looping value in your class with the command line option
--loop-mode
.
--max-running-jobs
Another configuration inherited from the base workflow manager allows to set the maximal number of children jobs that can be running at a given moment. By default
there is no maximal. You can put a limit to this number either by class attribute default_max_jobs
, or by command line option --max-running-jobs
.
The remaining set of options, and the main argument, are added by the CrawlManager
class itself and they are self explicative, considering the purpose of the crawl manager script.
So, let's exemplify the usage of the crawl manager. Let's suppose you have a spider called amazon.com
that accepts some parameters like department
and search_string
.
From command line, assuming you have a fully installed development environment for your project, you may call your script in this way:
> python crawlmanager.py amazon.com --spider-args='{"department": "books", "search_string": "winnie the witch"}' --job-settings='{"CONCURRENT_REQUESTS": 2}'
All crawl managers support implicit target spider via the class attribute spider
. If provided, the spider
command line argument is unavailable:
class MyCrawlManager(...):
name = ´crawl'
loop_mode = 120
spider = "amazon.com"
So the command line call will be the same as before, but without the spider
argument:
> python crawlmanager.py [--spider-args=... ...]
The periodic crawl manager is very similar to the simplest one described in previous section. But instead of scheduling a simple spider job, on each loop it will check periodically for the job status. And when the job finishes, it schedules a new job. For activating this behaviour you need to set loop mode as explained above. Example:
from shub_workflow.crawl import PeriodicCrawlManager
class CrawlManager(PeriodicCrawlManager):
name = ´crawl'
# check every 180 seconds the status of the scheduled job
loop_mode = 180
spider = "amazon.com"
@property
def description(self):
return 'Periodic Crawl manager for MyProject.'
if __name__ == '__main__':
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
crawlmanager = CrawlManager()
crawlmanager.run()
This crawl manager also schedules a spider periodically (in fact, it is a sub class of the PeriodicCrawlManager
), but instead of being controlled
by an infinite loop, it is controlled by a generator that provides the arguments for each spider job it will schedule. Once the generator stops
iterating and all scheduled jobs are completed, the crawl manager finishes itself.
The generator method is an abstract class method that need to be overridden. It must yield dictionaries with {argument name: argument value} pairs. Each new yielded dictionary will override the base spider arguments already defined by command line, if any.
On each loop, it will check whether the number of running spiders is below the max number of jobs allowed (controlled either by attribute default_max_jobs
or by command line). If so, it will take multiple dictionaries of arguments from the generator (as much as to fill the free slots), and schedule a new job for each one.
For other details see the code.
This is useful, for example, when each spider job need to process files from an s3 folder. A very simple exaple:
from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.utils.futils import list_folder
INPUT_FOLDER = "s3://mybucket/myinputfolder"
class CrawlManager(GeneratorCrawlManager):
name = ´crawl'
loop_mode = 120
default_max_jobs = 4
spider = "myspider"
description = "My generator manager"
def set_parameters_gen(self):
for input_file in list_folder(INPUT_FOLDER):
yield {
"input_file": input_file,
}
Here, the attribute spider
(or the command line argument for the spider, in case the attribute is not provided) indicates which spider use by default when scheduling
a new job. In the above example, the spider myspider
with argument input_file=<...>
will be scheduled for each input file found at the listed folder.
However, the spider name itself can be included in the yielded parameters. Example:
from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.utils.futils import list_folder
INPUT_FOLDER = "s3://mybucket/myinputfolder"
class CrawlManager(GeneratorCrawlManager):
name = 'crawl'
loop_mode = 120
default_max_jobs = 4
spider = "myspider"
description = "My generator manager"
def set_parameters_gen(self):
for input_file in list_folder(INPUT_FOLDER):
spider = input_file.split("_")[0]
yield {
"spider": spider,
"input_file": input_file,
}
In the specific example code above the class attribute spider
may seem unnecessary. However, if not specified (that is, set to None
), the command line
will require an spider name as argument, as explained in the first section of this chapter. In order to prevent this, you can either set it, as in the example, to a default
value (which will be used in case one of the yielded parameters don't include it explicitly) or set to empty string (in which case you always need to include it explicitly
in the yielded paramaters).
Scrapy Cloud parameters like project_id
(for cross project scheduling), units
, tags
and job_settings
can be included on yielded parameters as well.
When a spider job finishes with an abnormal finish status (outcome), we typically want to do something. For example, raising an alert somewhere, or retrying
the spider with modified spider arguments. For handling bad outcome jobs, you must override the method bad_outcome_hook()
, available in all crawl manager classes.
This method will be called when a job finishes with any of the outcomes defined in the list attribute self.failed_outcomes
which, by default are the following ones:
base_failed_outcomes = (
"failed",
"killed by oom",
"cancelled",
"cancel_timeout",
"memusage_exceeded",
"diskusage_exceeded",
"cancelled (stalled)",
)
defined in WorkFlowManager
class. You can append any other custom failed outcome to self.failed_outcomes
.
By default, the GeneratorCrawlManager implements a retry logic in this hook, that you can configure with the manager attribute MAX_RETRIES
(which is defaulted to 0).
This retry operates for all failed outcomes except cancelled
one (as it is assumed that you don't want to retry a manually cancelled job) and only if MAX_RETRIES
is
bigger than 0.
But you can also override the bad outcome hook logic. As a simple example, lets suppose some spiders may finish with memory problems, and in that case you want to retry it with a bigger number of units. In that case, you can add the following method to your generator crawl manager:
class CrawlManager(GeneratorCrawlManager):
(...)
def bad_outcome_hook(self, spider, outcome, job_args_override, jobkey):
if outcome == "memusage_exceeded" and job_args_override.get("units") == 1:
LOGGER.info(f"Job {jobkey} failed with {outcome}. Will retry with 6 units.")
job_args_override["units"] = 6
self.add_job(spider, job_args_override)
The code above instructs to add a new job with increased number of units, and all other parameters equal, when a spider finishes with outcome memusage_exceeded
. The jobs
added with this method will run first, before continue processing the set_parameters_gen()
generator.
Note: the method add_job()
is only available on GeneratorCrawlManager
. Its purpose is not compatible with the use cases of CrawlManager
and PeriodicCrawlManager
.
The CrawlManager base class also provides the method finished_ok_hook()
. This method is called when a job finishes with no bad outcome. That is, either bad_outcome_hook()
or finished_ok_hook()
will be called, but not both. The default implementation does nothing. But you can implement here any action you want to be executed after each
spider job finished succesfully, including to schedule some script or another spider. The arguments of this hook are the same as in the bad_outcome_hook()
, so you have
presumably all the data you need in order to feed and implement the desired hook action.
Next Chapter: Graph Managers