Monitors - scrapinghub/shub-workflow GitHub Wiki
Previous Chapter: Graph Managers with Frontera
- Introduction
- Spiders grouping
- Additional spider stats
- Scripts stats
- Stats from script logs
- Extending capabilities
- Spider job hook
- Script job hook
- Additional check methods
- Stats post processing
- Stats ratios
- Stats hooks
- Sentry alerts
- Slack alerts
- History tracking of monitor stats
- Aggregated stats in Crawlmanager
Spidermon is the standard and complete package for monitoring spiders jobs, raise alerts and generating stats in the context of a single spider job. But when you have multiple spider jobs and scripts, a big view of what is going on in the entire workflow is usually required. And spidermon has no this big view. In addition, events that are usually a bad sign within a single job, like zero items extracted, is not necessarily an error in the context of a big amount of jobs. Lets think for example in a job that makes searches all the time. Some jobs may not yield results. Others does.
shub-workflow
provides a base monitor class, easily configurable, that allows to quickly set up a monitor script that aggregates and generate stats from spiders and scripts stats and logs, over
any custom period of time. The base class monitor already gathers and aggregates some default stats, like scrapy downloader responses, scraped items and jobs. For example, this is the most
basic monitor you can build:
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
pass
if __name__ == "__main__":
import logging
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
script = Monitor()
script.run()
Then run the help on the resulting script:
> python monitor.py -h
Couldn't find Dash project settings, probably not running in Dash
usage: You didn't set description for this script. Please set description property accordingly. [-h] [--project-id PROJECT_ID] [--flow-id FLOW_ID] [--children-tag CHILDREN_TAG] [--period PERIOD]
[--end-time END_TIME] [--start-time START_TIME]
options:
-h, --help show this help message and exit
--project-id PROJECT_ID
Either numeric id, or entry keyword in scrapinghub.yml. Overrides target project id.
--flow-id FLOW_ID If given, use the given flow id.
--children-tag CHILDREN_TAG, -t CHILDREN_TAG
Additional tag added to the scheduled jobs. Can be given multiple times.
--load-sc-settings If provided, and running on a local environment, load scrapy cloud settings.
--subject SUBJECT Set alert message subject.
--period PERIOD, -p PERIOD
Time window period in seconds. Default: 86400
--end-time END_TIME, -e END_TIME
End side of the time window. By default it is just now. Format: any string that can be recognized by dateparser.
--start-time START_TIME, -s START_TIME
Starting side of the time window. By default it is the end side minus the period. Format: any string that can be recognized by dateparser.
As you can see from the help, the default time window that the monitor will explore are the last 24 hours, ending right now. But let's say you want to print all stats for all the spiders that ran in your project in the last hour:
> python monitor.py --project-id=<project id> -s "one hour ago"
python monitor.py --project-id=production -s "one hour ago"
Couldn't find Dash project settings, probably not running in Dash
2024-08-08 15:46:51,809 shub_workflow.script [WARNING]: SHUB_JOBKEY not set: not running on ScrapyCloud.
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Start time: 2024-08-08 14:46:51.839459
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: End time: 2024-08-08 15:46:51.809693
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Checking script_logs...
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Checking scripts_stats...
2024-08-08 15:46:51,841 shub_workflow.utils.monitor [INFO]: Checking spiders...
2024-08-08 15:46:57,525 shub_workflow.script [INFO]: {'downloader/response_count/35photo': 1655,
'downloader/response_count/500px': 6545,
'downloader/response_count/behance': 9976,
'downloader/response_count/deviantart': 28275,
'downloader/response_count/imgur': 13417,
'downloader/response_count/instagram': 11959,
'downloader/response_count/pinterest': 3543,
'jobs/35photo': 8,
'jobs/500px': 7,
'jobs/behance': 1,
'jobs/deviantart': 60,
'jobs/imgur': 301,
'jobs/instagram': 90,
'jobs/pinterest': 4,
'scraped_items/35photo': 849,
'scraped_items/500px': 12,
'scraped_items/behance': 8606,
'scraped_items/deviantart': 10971,
'scraped_items/imgur': 16415,
'scraped_items/instagram': 8232,
'scraped_items/pinterest': 3188,
'scraped_items/total': 48472}
The same for the entire previous day:
> python monitor.py --project-id=production -e "today at 0:00 GMT"
Couldn't find Dash project settings, probably not running in Dash
2024-08-08 16:12:46,632 shub_workflow.script [WARNING]: SHUB_JOBKEY not set: not running on ScrapyCloud.
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Period set: 1 day, 0:00:00
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Start time: 2024-08-06 21:00:00
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: End time: 2024-08-07 21:00:00
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Checking script_logs...
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Checking scripts_stats...
2024-08-08 16:12:46,668 shub_workflow.utils.monitor [INFO]: Checking spiders...
2024-08-08 16:12:57,949 shub_workflow.utils.monitor [INFO]: Read 1000 jobs
2024-08-08 16:13:01,692 shub_workflow.utils.monitor [INFO]: Read 2000 jobs
2024-08-08 16:13:05,377 shub_workflow.utils.monitor [INFO]: Read 3000 jobs
2024-08-08 16:13:09,081 shub_workflow.utils.monitor [INFO]: Read 4000 jobs
2024-08-08 16:13:12,947 shub_workflow.utils.monitor [INFO]: Read 5000 jobs
2024-08-08 16:13:16,638 shub_workflow.utils.monitor [INFO]: Read 6000 jobs
2024-08-08 16:13:20,439 shub_workflow.utils.monitor [INFO]: Read 7000 jobs
2024-08-08 16:13:24,368 shub_workflow.utils.monitor [INFO]: Read 8000 jobs
2024-08-08 16:13:28,422 shub_workflow.utils.monitor [INFO]: Read 9000 jobs
2024-08-08 16:13:31,799 shub_workflow.utils.monitor [INFO]: Read 10000 jobs
2024-08-08 16:13:35,587 shub_workflow.utils.monitor [INFO]: Read 11000 jobs
2024-08-08 16:13:39,580 shub_workflow.utils.monitor [INFO]: Read 12000 jobs
2024-08-08 16:13:43,327 shub_workflow.utils.monitor [INFO]: Read 13000 jobs
2024-08-08 16:13:47,874 shub_workflow.utils.monitor [INFO]: Read 14000 jobs
2024-08-08 16:13:47,905 shub_workflow.script [INFO]: {'downloader/response_count/35photo': 169251,
'downloader/response_count/500px': 26526,
'downloader/response_count/behance': 567736,
'downloader/response_count/deviantart': 1582627,
'downloader/response_count/fandom': 534506,
'downloader/response_count/imgur': 356781,
'downloader/response_count/instagram': 233300,
'downloader/response_count/pexels': 535933,
'downloader/response_count/pinterest': 126707,
'jobs/35photo': 636,
'jobs/500px': 588,
'jobs/behance': 80,
'jobs/deviantart': 2447,
'jobs/fandom': 6,
'jobs/imgur': 8393,
'jobs/instagram': 1695,
'jobs/pexels': 105,
'jobs/pinterest': 172,
'scraped_items/35photo': 87439,
'scraped_items/500px': 200,
'scraped_items/behance': 496376,
'scraped_items/deviantart': 611786,
'scraped_items/fandom': 509872,
'scraped_items/imgur': 513995,
'scraped_items/instagram': 155828,
'scraped_items/pexels': 252080,
'scraped_items/pinterest': 115061,
'scraped_items/total': 2744043}
As you can see, the default stats aggregated are total scraped items, discriminated by spider, responses discriminated by spider, and number of jobs per spider. For all spider jobs that finished in
the selected time window. But if you provide the command line option --flow-id
, or the monitor script is scheduled from a graph manager, the monitor will select only those jobs belonging to the same
flow id.
You can also see that the -s
and -e
options are very flexible for defining specific time windows thanks to its support of any string that dateparser can understand. Notice that the applied time window
is printed in the log (here, the printed hour limit is different in number because the asked end hour is 0 GMT and the test was run in a -3 time zone). In addition, there is an option, -p
in order
to set the size of the time window (in secs) in case the -s
option, which has no default value, is missing. By default, with the time window parameters not specified, the end of the time window is just
now, and the start is just now minus the period provided by -p
, which defaults to 1 day (86400 secs). But whenever -s
is provided, it overrides the value of -p
. An important notion is that,
the time window filters spider and script jobs to collect stats from, based on their finish time. In case of long living scripts, you can generate monitor aggregated stats from the script log. In this case
the time window is used to select log lines based on its time stamp. How to configure the different capturing modes will be explained in the following sections.
The basic instantiation of a monitor from BaseMonitor
class aggregates stats, as we saw above, for any spider job found within the selected conditions (time window, workflow id if any).
But you may want to capture separately the stats from different groups of spiders, and only for specific ones. For these purposes, the monitor has a specialized attribute, target_spider_classes
:
(...)
from scrapy import Spider
(...)
class BaseMonitor(BaseScript):
# a map from spiders classes to check, to a stats prefix to identify the aggregated stats.
target_spider_classes: Dict[Type[Spider], str] = {Spider: ""}
(...)
This attribute is a dictionary from a spider class to a string which adds a prefix on the stats captured from the jobs of the spiders that are a subclass of the indicated class. By default, as
you can see above, the default value of this attribute is {"Spider": ""}
which means that the stats from all spiders are captured, and the resulting aggregated stats are prefixed with an empty
string, that is, no prefix added.
Lets suppose we have a project where we discover url videos from different sources, and another separated spider actually downloads the images from the discovered urls. And we want separated stats for the discovery and for the download. We can define our monitor in this way:
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
(...)
The resulting stats:
{'discovery/downloader/response_count/tiktok': 183168,
'discovery/downloader/response_count/youtube': 9257802,
'discovery/jobs/tiktok': 6508,
'discovery/jobs/youtube': 10552,
'discovery/scraped_items/tiktok': 156325,
'discovery/scraped_items/total': 8753838,
'discovery/scraped_items/youtube': 8597513,
'vdownload/downloader/response_count/downloader': 322440,
'vdownload/jobs/downloader': 9,
'vdownload/scraped_items/downloader': 319999,
'vdownload/scraped_items/total': 319999}
So, discovery spiders stats are prefixed with "discovery/", while downloader stats are prefixed with "vdownload/", so you can see separated the stats from different spider groups of the workflow.
As mentioned before, by default the collected stats from spiders are the jobs, the scraped items and the response counts. But you may also want to collect other stats, and not only standard scrapy ones but
also any custom one added by the spiders. A tuple attribute, target_spider_stats
, is stablished for this purpose:
(...)
class BaseMonitor(BaseScript):
(...)
# stats aggregated from spiders. A tuple of stats regex prefixes.
target_spider_stats: Tuple[str, ...] = ()
(...)
Lets continue with the previous example, by setting additional collected stats prefixes, in this case, dropped_videos/length_exceeded
,delivered_count/
and delivered_mb/
:
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
target_spider_stats = ("dropped_videos/length_exceeded", "delivered_count/", "delivered_mb/")
(...)
And here a sample result:
{'discovery/downloader/response_count/tiktok': 179653,
'discovery/downloader/response_count/youtube': 8951749,
'discovery/dropped_videos/length_exceeded/10_to_20/tiktok': 102,
'discovery/dropped_videos/length_exceeded/10_to_20/total': 8707,
'discovery/dropped_videos/length_exceeded/10_to_20/youtube': 8605,
'discovery/dropped_videos/length_exceeded/20_to_30/tiktok': 24,
'discovery/dropped_videos/length_exceeded/20_to_30/total': 3094,
'discovery/dropped_videos/length_exceeded/20_to_30/youtube': 3070,
'discovery/dropped_videos/length_exceeded/30_more/tiktok': 7,
'discovery/dropped_videos/length_exceeded/30_more/total': 8109,
'discovery/dropped_videos/length_exceeded/30_more/youtube': 8102,
'discovery/dropped_videos/length_exceeded/4_to_6/tiktok': 1904,
'discovery/dropped_videos/length_exceeded/4_to_6/total': 8666,
'discovery/dropped_videos/length_exceeded/4_to_6/youtube': 6762,
'discovery/dropped_videos/length_exceeded/6_to_10/tiktok': 1081,
'discovery/dropped_videos/length_exceeded/6_to_10/total': 8427,
'discovery/dropped_videos/length_exceeded/6_to_10/youtube': 7346,
'discovery/dropped_videos/length_exceeded/tiktok': 3118,
'discovery/dropped_videos/length_exceeded/total': 37003,
'discovery/dropped_videos/length_exceeded/youtube': 33885,
'discovery/jobs/tiktok': 6382,
'discovery/jobs/youtube': 11424,
'discovery/scraped_items/tiktok': 164604,
'discovery/scraped_items/total': 8461749,
'discovery/scraped_items/youtube': 8297145,
'vdownload/delivered_count/youtube': 272024,
'vdownload/delivered_mb/youtube': 5934998.122000018,
'vdownload/downloader/response_count/downloader': 322440,
'vdownload/jobs/downloader': 9,
'vdownload/scraped_items/downloader': 319999,
'vdownload/scraped_items/total': 319999}
We have again aggregated stats grouped by discovery and vdownload and the same default stats already collected in previous examples. And, whenever any of the spiders in the target groups have any stats with the prefixes declared on target spider stats, they are collected and aggregated as well.
Our monitor also supports aggregation of stats generated by scripts. As scripts are so diverse in function, they don't generate stats by default, but every script subclassed from
shub_workflow.script.BaseScript
class does have a stats object that allows the coder to log any required stats, similar to what spiders does natively. In fact the BaseScript
stats collector class is the same defined for spiders, and configured by the project STATS_CLASS
scrapy setting.
So, assuming your scripts generate stats, the shub_workflow
monitor can collect and aggregate them. The configuration attribute for this purpose is target_script_stats
, defined as:
(...)
class BaseMonitor(BaseScript):
(...)
# - a map from script name into a tuple of 2-elem tuples (aggregating stat regex, aggregated stat prefix)
# - the aggregating stat regex is used to match stat on target script
# - the aggregated stat prefix is used to generate the monitor stat. The original stat name is appended to
# the prefix.
# - if a group is present in the regex, its value is used as suffix of the generate stat, instead of
# the complete original stat name.
target_script_stats: Dict[str, Tuple[Tuple[str, str], ...]] = {}
(...)
Continuing with our example:
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
target_spider_stats = ("dropped_videos/length_exceeded", "delivered_count/", "delivered_mb/")
target_script_stats = {
"py:deliver.py": ((r"^videos/(?!total)(.+)_time_secs", "delivered_time_secs"),),
}
(...)
For having a context, lets suppose that the py:delivery.py
script generates the following stats:
videos/youtube_time_secs
videos/tiktok_time_secs
videos/total_time_secs
We want to capture the first two, but not the total one (otherwise it will be wrongly aggregated as if there were another spider called total
, when it is already a sum of youtube and tiktok ones)
So, the defined target_script_stats
instructs the monitor to:
- collect in the
py:delivery.py
script, stats that matches the regexr"^videos/(.+)_time_secs"
, except the onevideos/total_time_secs
. That is why the final regexr"^videos/(?!total)(.+)_time_secs"
. - it will generate stats with the prefix
delivered_time_secs
, appended by the value of the regex group, in this case eithertiktok
oryoutube
.
The result:
{'delivered_time_secs/total': 91403985,
'delivered_time_secs/twitch': 2206250,
'delivered_time_secs/youtube': 89197735,
'discovery/downloader/response_count/tiktok': 175636,
'discovery/downloader/response_count/youtube': 10131111,
'discovery/dropped_videos/length_exceeded/10_to_20/tiktok': 162,
'discovery/dropped_videos/length_exceeded/10_to_20/total': 8205,
'discovery/dropped_videos/length_exceeded/10_to_20/youtube': 8043,
'discovery/dropped_videos/length_exceeded/20_to_30/tiktok': 19,
'discovery/dropped_videos/length_exceeded/20_to_30/total': 10019,
'discovery/dropped_videos/length_exceeded/20_to_30/youtube': 10000,
'discovery/dropped_videos/length_exceeded/30_more/tiktok': 10,
'discovery/dropped_videos/length_exceeded/30_more/total': 6291,
'discovery/dropped_videos/length_exceeded/30_more/youtube': 6281,
'discovery/dropped_videos/length_exceeded/4_to_6/tiktok': 2508,
'discovery/dropped_videos/length_exceeded/4_to_6/total': 10860,
'discovery/dropped_videos/length_exceeded/4_to_6/youtube': 8352,
'discovery/dropped_videos/length_exceeded/6_to_10/tiktok': 1501,
'discovery/dropped_videos/length_exceeded/6_to_10/total': 9886,
'discovery/dropped_videos/length_exceeded/6_to_10/youtube': 8385,
'discovery/dropped_videos/length_exceeded/tiktok': 4200,
'discovery/dropped_videos/length_exceeded/total': 45261,
'discovery/dropped_videos/length_exceeded/youtube': 41061,
'discovery/jobs/tiktok': 6201,
'discovery/jobs/youtube': 10119,
'discovery/scraped_items/tiktok': 191510,
'discovery/scraped_items/total': 9585329,
'discovery/scraped_items/youtube': 9393819,
'vdownload/delivered_count/twitch': 80348,
'vdownload/delivered_count/youtube': 1053336,
'vdownload/delivered_mb/twitch': 1360238.9330000002,
'vdownload/delivered_mb/youtube': 22628197.08300002,
'vdownload/downloader/response_count/downloader': 1400806,
'vdownload/jobs/downloader': 69,
'vdownload/scraped_items/downloader': 1385000,
'vdownload/scraped_items/total': 1385000}
Notice the new delivered_time_secs
stats. The total here is not a capture of the videos/total_time_sec
, but the sum of the collected stats for youtube and tiktok.
In your project, you can add stats capture for every script you need, and you can add any amount of pair (regex, stat prefix).
In all the collecting modes we explained above, the monitor collects final stats of spider and script jobs that finished within the selected time window. But in some projects there are long lived scripts that run for days, weeks or even months, for which stats collecting is not adequate. You can still collect stats, but they will reflect only the current status of the job. The history, required for building aggregated stats for specific time windows, is lost. In general, if the time window is much smaller than the life time of the jobs you want to monitor, this problem with stats may always be present. Even for short lived scripts and spiders. However, in the majority of these cases, collecting final stats is what we want and we don't care about its history.
But when we do require to collect historic data, and this is specially the case when there are long running scripts that process data and other jobs in a continuous fashion, the monitor allows
to generate stats from the script logs. For configuring stats collecting from logs, the monitor has the attribute target_script_logs
:
(...)
class BaseMonitor(BaseScript):
(...)
# - a map from script name into a tuple of 2-elem tuples (aggregating log regex, aggregated stat name)
# - the aggregating log regex must match log lines in the target script job, with a group to extract a number
# from it. If not a group number is extracted, the match alone aggregates 1.
# - the final aggregated stat name is the specified aggregated stat name, plus a second non numeric group in the
# match, if exists.
target_script_logs: Dict[str, Tuple[Tuple[str, str], ...]] = {}
(...)
Lets see how this works with an example, continuation of the previos one:
from shub_workflow.utils.monitor import BaseMonitor
from myproject.spiders import BaseDiscoverySpider
from myproject.spiders.downloader import DownloaderSpider
class Monitor(BaseMonitor):
target_spider_classes = {BaseDiscoverySpider: "discovery", DownloaderSpider: "vdownload"}
target_spider_stats = ("dropped_videos/length_exceeded", "delivered_count/", "delivered_mb/")
target_script_stats = {
"py:deliver.py": ((r"^videos/(?!total)(.+)_time_secs", "delivered_time_secs"),),
}
target_script_logs = {
"py:balancer.py": (
(r"Wrote (\d+).+?([a-z]+)_\d+T\d+.jl.gz", "new_urls"),
(r"Wrote (\d+).+?Filter_([a-z]+)_\d+T\d+.jl.gz", "new_filters_urls"),
),
}
(...)
Here the target_script_logs
instructs the monitor to capture log from the a script named "py:balancer.py". There are two regular expressiones defined. One will aggregate to the new_urls
stat prefix,
while the other to the new_filters_urls
stat prefix. For context, the first regex is intended to match log lines like this one:
Wrote 40000 records to youtube_20240803T160512.jl.gz
and regex groups are defined to match the number (in this case, 40000) and the stat suffix (in this case, youtube
). So, 40000 will be added to a stat called new_urls/youtube
.
The second regex tries to match lines like this one:
Wrote 17000 records to CHNFilter_youtube_20240803T174413.jl.gz
The two groups matches 17000 and youtube
. So in this case, 17000 is added to a stat called new_filters_urls/youtube
. Notice that the first regex also matches the lines matched
by the second regex, but not at the inverse. So the first stat prefix is more general and will aggregate this example line too. This of course is not a requirement. It is only the
case for this tutorial example.
The groups are optional. If no number group is extracted, it will just sum up a unity. If no suffix group is present, then the resulting stat name will be only the prefix. In addition, the number and the suffix groups, in case both are present, can be in any order.
And here the sample result for the example above:
{'delivered_time_secs/total': 94604203,
'delivered_time_secs/twitch': 2315144,
'delivered_time_secs/youtube': 92289059,
'discovery/downloader/response_count/tiktok': 175811,
'discovery/downloader/response_count/youtube': 10141688,
'discovery/dropped_videos/length_exceeded/10_to_20/tiktok': 148,
'discovery/dropped_videos/length_exceeded/10_to_20/total': 7738,
'discovery/dropped_videos/length_exceeded/10_to_20/youtube': 7590,
'discovery/dropped_videos/length_exceeded/20_to_30/tiktok': 18,
'discovery/dropped_videos/length_exceeded/20_to_30/total': 9736,
'discovery/dropped_videos/length_exceeded/20_to_30/youtube': 9718,
'discovery/dropped_videos/length_exceeded/30_more/tiktok': 9,
'discovery/dropped_videos/length_exceeded/30_more/total': 4575,
'discovery/dropped_videos/length_exceeded/30_more/youtube': 4566,
'discovery/dropped_videos/length_exceeded/4_to_6/tiktok': 2461,
'discovery/dropped_videos/length_exceeded/4_to_6/total': 10287,
'discovery/dropped_videos/length_exceeded/4_to_6/youtube': 7826,
'discovery/dropped_videos/length_exceeded/6_to_10/tiktok': 1471,
'discovery/dropped_videos/length_exceeded/6_to_10/total': 9486,
'discovery/dropped_videos/length_exceeded/6_to_10/youtube': 8015,
'discovery/dropped_videos/length_exceeded/tiktok': 4107,
'discovery/dropped_videos/length_exceeded/total': 41822,
'discovery/dropped_videos/length_exceeded/youtube': 37715,
'discovery/jobs/tiktok': 6203,
'discovery/jobs/youtube': 10163,
'discovery/scraped_items/tiktok': 188387,
'discovery/scraped_items/total': 9594702,
'discovery/scraped_items/youtube': 9406315,
'vdownload/delivered_count/twitch': 82660,
'vdownload/delivered_count/youtube': 1086359,
'vdownload/delivered_mb/twitch': 1388436.563,
'vdownload/delivered_mb/youtube': 23503451.690000016,
'vdownload/downloader/response_count/downloader': 1443307,
'vdownload/jobs/downloader': 71,
'vdownload/scraped_items/downloader': 1427500,
'vdownload/scraped_items/total': 1427500,
'new_filters_urls': 600000,
'new_filters_urls/youtube': 600000,
'new_urls': 739342,
'new_urls/twitch': 139342,
'new_urls/youtube': 600000}
The configuration attributes described in the previous sections covers a broad set of most common use cases, but still may not contemplate other ones. For example, we may want to count the files in some filesystem (i.e. s3, gcs), or we may want to count the number of jobs with some specific argument, or whatever.
The monitor admits two kinds of extensions: job hooks and check methods. Job hooks come in two flavours: spider job hooks and script job hooks.
The base monitor has the following method:
(...)
class BaseMonitor(BaseScript):
(...)
def spider_job_hook(self, jobdict: JobDict):
"""
This is called for every spider job retrieved, from the spiders declared on target_spider_classes,
so additional per job customization can be added.
"""
(...)
which is called on every spider job selected by the target_spider_classes
attribute and matches the other selectors like workflow id or time window. The parameter passed to this method is a
JobDict
object containing the spider
name, spider_args
, scrapystats
, tags
and close_reason
. The default method does nothing. But you can override it.
Example override in our project monitor:
from shub_workflow.script import JobDict
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
(...)
def spider_job_hook(self, jobdict: JobDict):
if jobdict["spider"] == "downloader" and "Filter_youtube_" in jobdict["spider_args"]["source"]:
self.stats.inc_value(
"vdownload/delivered_count/youtube/filters", jobdict["scrapystats"].get("delivered_count/youtube", 0)
)
This example code aggregates the spider job stat delivered_count/youtube
for a specific spider with a specific pattern in spider argument, into the monitor stat
vdownload/delivered_count/youtube/filters
. Notice that basically this is what the combination of target_spider_classes
and target_spider_stats
instructs to do:
get some stat from spiders and aggregate into a target monitor stat. However, with these attributes you cannot select jobs based on spider arguments, so more complex
use cases can be handled via this hook.
Similarly, we have the same hook, but for scripts, in this case for the ones declared on target_script_stats
:
(...)
class BaseMonitor(BaseScript):
(...)
def script_job_hook(self, jobdict: JobDict):
"""
This is called for every script job retrieved, from the scripts declared on target_script_stats,
so additional per job customization can be added
"""
(...)
If you don't need to aggregate any specific stat using the target_script_stats
, and only need to declare a script only to be handled by this hook, you can just set the
value of the corresponding entry to an empty tuple:
from shub_workflow.script import JobDict
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
target_script_stats = {
"py:myscript.py": ()
}
def script_job_hook(self, jobdict: JobDict):
if jobdict["spider"] == "py:myscript.py":
(...)
If you have a single script declared on target_script_stats
, it may seem redundant to check in the script hook if the job script name is the one you want to target. But if you later add
more scripts to target_script_stats
, you may forgot that you have the script job hook. So for safety it is not bad to check the script name anyway.
Up to now we have only reviewed alternatives for aggregating stats from scripts and spiders. But you may want to monitor content in a filesystem, available requests in a zyte api account,
or anything else. You can extend the capabilities of the monitor by adding a method that starts with the prefix check_
and receives two numbers: the start limit and the end limit of the
time window (in epoch seconds):
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
(...)
def check_<whatever>(self, start_limit, end_limit):
(...)
(...)
Any method that starts with check_
will automatically be ran by the monitor. In fact, the base monitor already contains two of such methods: check_spiders()
and
check_scripts()
, that performs everything we have explained until last section in this document.
There is a callback that will be called after all the check methods have been executed:
(...)
class BaseMonitor(BaseScript):
(...)
def stats_postprocessing(self, start_limit, end_limit):
"""
Apply here any additional code for post processing stats, generate derivated stats, etc.
"""
(...)
By default, it doesn't perform anything. But you can override it for post processing the stats generated during checks (you can access the collected stats via the self.stats
attribute, as usual),
like printing reports, computing additional stats from the collected ones, saving all stats on a collection for having historical records, etc. The parameters received are the two ends of the
time windows, as in the check methods explained above.
Within the stats_postprocessing()
method you can derive many other stats from the collected ones. But computing ratios between stats is a very frequent operation and the code to compute them very
similar. So our BaseMonitor
class provides a shortcut for avoiding to repeat post processing code:
class BaseMonitor(BaseScript):
(...)
# Define here ratios computations. Each 3-tuple contains:
# - a string or regex that matches the numerator stat. You can use a regex group here for computing multiple ratios
# - a string or regex that matches the denominator stat. You can use a regex group for matching numerator group and
# have a different denominator for each one. If no regex group in denominator, there will be a single
# denominator.
# - the target stat. If there are numerator groups, this will be the target stat prefix.
# Ratios computing are performed after the stats_postprocessing() method is called. It is a stat postprocessing
# itself.
stats_ratios: Tuple[Tuple[str, str, str], ...] = ()
(...)
An example of usage of this feature:
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
(...)
stats_ratios = (
(
"scrapy-zyte-api/429/total",
"scrapy-zyte-api/processed/total",
"scrapy-zyte-api/429/total/ratio",
),
(
"spidermon/validation/.+?/errors/.+/(.+)",
"scraped_items/(.+)",
"spidermon/validation/errors/rate",
),
)
(...)
The first tuple instructs to compute the ratio between the stats scraped_images/scrapy-zyte-api/429/total
(numerator) and scraped_images/scrapy-zyte-api/processed/total
(denominator) and
save the result into the new stat scraped_images/scrapy-zyte-api/429/total/ratio
. This is an example result:
(...)
'scrapy-zyte-api/429/total': 36915,
'scrapy-zyte-api/429/total/ratio': 0.0114,
(...)
'scrapy-zyte-api/processed/total': 3231330,
(...)
Of course, the operator stats were generated previously in the monitor, probably using the attribute target_spider_stats
described before.
The second tuple in stats_ratios
structure instructs to compute the ratio between validation errors and scraped items for different suffixes (denoted by the group (.+)
in the stats regexes),
and save the result into various stats prefixed by scraped_images/spidermon/validation/errors/rate
, with the corresponding suffix. Example result:
(...)
'scraped_items/35photo': 3496,
'scraped_items/500px': 19610,
'scraped_items/artstation': 78913,
'scraped_items/behance': 89749,
'scraped_items/deviantart': 144816,
'scraped_items/fandom': 2200534,
'scraped_items/imgur': 279681,
'scraped_items/instagram': 30300,
'scraped_items/pexels': 43440,
'scraped_items/pinterest': 8017,
'scraped_items/total': 2880133,
(...)
'spidermon/validation/errors/rate/35photo': 0.0003,
'spidermon/validation/errors/rate/artstation': 0.0011,
'spidermon/validation/errors/rate/fandom': 0.0,
'scraped_images/spidermon/validation/errors/rate/total': 0.0,
(...)
'spidermon/validation/fields/errors/missing_required_field/fandom': 2,
'spidermon/validation/fields/errors/missing_required_field/images.0.url/fandom': 1,
'spidermon/validation/fields/errors/missing_required_field/images.0.url/total': 1,
'spidermon/validation/fields/errors/missing_required_field/source_url/fandom': 1,
'spidermon/validation/fields/errors/missing_required_field/source_url/total': 1,
'spidermon/validation/fields/errors/missing_required_field/total': 2,
'spidermon/validation/seeds_fields/errors/usernames/35photo': 1,
'spidermon/validation/seeds_fields/errors/usernames/artstation': 87,
'scraped_images/spidermon/validation/seeds_fields/errors/usernames/total': 88,
Finally, you may want to react on specific conditions for all the stats generated in the previous steps (checks and post processing). For example, send alerts via email or via some specialized alert
system. Or you may want to store specific stats somewhere. All these could be achieved by adding manually some code at the end of your stats_postprocessing()
method. But the monitor provides a shortcut
approach that is configured via the attribute stats_hook
and allows a better modularization of your code:
class BaseMonitor(BaseScript):
(...)
# A tuple of 2-elem tuples each one with a stat regex and the name of the monitor instance method that will receive:
# - start and end limit of the window (epoch seconds)
# - the value of the stat
# - one extra argument per regex group, if any.
# Useful for adding monitor alerts or any kind of reaction according to stat value.
stats_hooks: Tuple[Tuple[str, str], ...] = ()
(...)
The stats hooks defined there will be executed after stats_postprocessing()
is called. An example for our tutorial might be:
from shub_workflow.utils.monitor import BaseMonitor
class Monitor(BaseMonitor):
(...)
stats_hooks = (
(r"new_urls/(.+)", "new_urls_hook"),
)
(...)
def new_urls_hook(self, start_limit, end_limit, value, source):
if value == 0:
print("Alert! for {source}: no new items extracted")
(...)
In our example project we are generating stats like "new_urls/tiktok"
and "new_urls/youtube"
. The regular expression defined in the example above will match any of this, and call the
new_urls_hook
with three fixed parameters (start_limit
, end_limit
and value
of the collected stat) and an extra parameter comming from the regex group, in this case either "youtube"
,
"tiktok"
, etc. And the new_urls_hook()
method defined in the example will check if the value is zero, and in that case, print an alert message.
Printer alert messages are not good enough for most cases. You really want to be notified when an alert condition occurs. shub-workflow
provides a mixin that uses the spidermon
sentry
class helper SendSentryMessage
, in order to raise notification on sentry system. Once spidermon is configured appropiately (see external doc
Spidermon Sentry action), you can modify the last example with alerts, in the following way:
from shub_workflow.utils.monitor import BaseMonitor
from shub_workflow.contrib.sentry import SentryMixin
class Monitor(SentryMixin, BaseMonitor):
default_subject = "My Project Monitor Notification"
(...)
stats_hooks = (
(r"new_urls/(.+)", "new_urls_hook"),
)
(...)
def new_urls_hook(self, start_limit, end_limit, value, source):
if value == 0:
self.append_message("Alert! for {source}: no new items extracted")
(...)
And that's all. Notice the two changes with respect to the previous version: now the monitor is also a subclass of the SentryMixin
class, and instead of print(message)
we use
self.append_message(message)
. Whether the message will be printed to console, or really sent to the sentry server, it will depend on the value of the setting SPIDERMON_SENTRY_FAKE
,
as described in the linked documentation.
Note: ensure you have installed in your project the packages spidermon
and sentry-sdk
.
In addition, you can set the alert notification subject via the attribute default_subject
or the command line option --subject
(which will override the default subject)
Similarly, shub_workflow
provides a mixing for enabling slack alerts. The usage is exactly the same as with the Sentry case, but instead of using a mixing for enabling slack alerts. The usage is exactly the same as with the Sentry case, but instead of using
shub_workflow.contrib.sentry.MonitorSentryMixin
, you will use shub_workflow.contrib.slack.SlackMixin
.
You may eventually need to monitor the monitors, that is, to see how the different aggregated stats from the monitors evolve in time. You can do this manually, by filling spreadsheets with the successive data generated by periodic runs of your monitor. Or may be you want a more automated approach, for example, by using the stats hooks described above for filling a google spreadsheet, or for storing certain stats in a collection. Alternatively, you may build a 'super' monitor that runs periodically, scans all monitor jobs with a specific script name, and react on some changes in some of the stats. Such a script may have the following general structure, by using what we learned from previous sections:
from typing import Dict, List
from collections import defaultdict
from shub_workflow.script import JobDict
from shub_workflow.utils.monitor import BaseMonitor
class HistoricalMonitor(BaseMonitor):
target_script_stats = {
"py:monitor.py": ()
}
def __init__(self):
super().__init__()
self.historics: Dict[str, List[int]] = defaultdict(list)
def script_job_hook(self, jobdict: JobDict):
if jobdict["spider"] == "py:monitor.py" and "DAILY" in jobdict["tags"]:
for stat, val in jobdict["scrapystats"].items():
self.historics[stat].append(val)
def stats_postprocessing(self):
for stat, historic_values in self.historics.items():
(...)
if __name__ == "__main__":
import logging
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
script = Monitor()
script.run()
For simple workflows, like the ones that only consists in a single crawlmanager, you don't need to set up a monitor. Crawlmanagers also generate aggregated stats for the scheduled spiders. By
default they aggregate the same spider stats aggregated by the monitor, and also supports the target_spider_stats
attribute, which works in the same way (in fact, this functionality comes from
a common base class). See Additional spider stats for details.
As typically the crawlmanagers only schedules jobs for a single spider, in order to avoid redundancy it only generates the total stats. If you want the crawlmanager to show aggregated stats
also for separated spiders, in case your crawlmanager schedules more than one, you can set the attribute stats_only_total = False
. By default, this attribute is True
for the crawlmanager,
and it is already False
for the monitor. So, on the contrary, in case you want a monitor to print only totals, you can set its attribute stats_only_total = True
.