Deduping (Duplication Reduction) - internetarchive/heritrix3 GitHub Wiki
Starting in release 1.12.0, a number of Processors can cooperate to carry forward URI content history between crawls (see org.archive.crawler.processor.recrawl package JavaDocs). This reduces the amount of duplicate material downloaded or stored in later crawls.
Heritrix 1.x does not support running the same crawl more than once, so one crawl will need to be configured for storing duplication reduction data, and another crawl will need to be configured for loading duplication reduction data. e.g. excerpt from testing for HER-1627:
- add FetchHistory and PersistLog processors after
FetchHttp
org.archive.crawler.processor.recrawl.FetchHistoryProcessor
org.archive.crawler.processor.recrawl.PersistLogProcessor
-
after PreconditionEnforcer, before FetchDNS
org.archive.crawler.processor.recrawl.PersistLoadProcessor -
after FetchHTTP
org.archive.crawler.processor.recrawl.FetchHistoryProcessor - preload-source:
${HERITRIX_HOME}/jobs/${JOB}/logs/persistlog.txtser.gz
Heritrix 3.x allows for running the same crawl repeatedly, but
requires a different configuration for the crawl run which stores
deduplication data, and the crawl run which loads deduplication data
as described in Duplication Reduction
Processors. The same model is
followed for H1, except using the Spring-world crawler beans CXML
(crawler-beans.cxml) for configurating.