Processing Chains - internetarchive/heritrix3 GitHub Wiki
At the job level, a Heritrix crawl job has three main pipelines, known as Processor Chains (sequential application of swappable Processor modules -- see Processor Settings), with the Frontier acting as a buffer between the first two:
- The Candidates Chain:
- This processing incoming Crawl URIs, deciding whether to keep them (according to the Scope), and priming them to be deposited in the Frontier.
- See Candidate Chain Processors
- The Frontier:
- Crawl URIs accepted into this crawl are stored here in priority order, in a set of distinct queues.
- Usually, there is one queue per 'authority' (e.g.
example.com:80), and the queue management ensures the desired crawl delay is honoured for each queue. - See Frontier
- The Fetch Chain:
- As Crawl URIs are emitted by the Frontier, the fetch chain processes each one and decides what to do with it, how to download it, etc.
- This chain also performs operations like link extraction.
- See Fetch Chain Processors
- The Disposition Chain:
- One the Fetch Chain has finished, any required post-processing is handled here.
- For example, this is where the downloaded resources are written into WARC files.
- See Disposition Chain Processors
Each URI taken off the Frontier queue runs through the processing chains. URIs are always processed in the order shown in the diagram below, unless a particular processor throws a fatal error or decides to stop the processing of the current URI.
Each processing chain is made
up of zero or more individual processors. For example, the FetchChain
might comprise the extractorCss and extractorJs processors. Within
a processing step, the order in which the processors are run is the
order in which they are listed in the crawler-beans.cxml file.

HeritrixProcessorChains.png
(image/png)