Fetch Chain Processors - internetarchive/heritrix3 GitHub Wiki

Processor Name	Description	Class Name
preparer	This processor prepares ACCEPTed URIs for enqueing in the Frontier. It is run again to recheck the scope of URIs before fetching begins.
preconditions	This processor verifies or triggers the fetching of prerequisite URIs.
fetchDns	This processor fetches DNS URIs.
fetchHttp	This processor fetches HTTP URIs. As of Heritrix 3.1, the crawler will now properly decode 'chunked' Transfer-Encoding -- even if encountered when it should not be used, as in a response to an HTTP/1.0 request. Additionally, the fetchHttp processor now includes the parameter 'useHTTP11', which if true, will cause Heritrix to report its requests as 'HTTP/1.1'. This allows sites to use the 'chunked' Transfer-Encoding. (The default for this parameter is false for now, and Heritrix still does not reuse a persistent connection for more than one request to a site.) fetchHttp also includes the parameter 'acceptCompression', which if true, will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header, which offers to receive compressed responses. (The default for this parameter is false for now.)
extractorHttp	This processor extracts outlinks from HTTP headers. As of Heritrix 3.1, the extractorHttp processor now considers any URI on a hostname to imply that the '/favicon.ico' from the same host should be fetched. Also, as of Heritrix 3.1, the "inferRootPage" property has been added to the extractorHttp bean. If this property is "true", Heritrix infers the '/' root page from any other URI on the same hostname. The default for this setting is "false", which means the pre-3.1 behavior of only fetching the root page if it is a seed or otherwise discovered and in-scope remains in effect. Discovery via these new heuristics is considered to be a new 'I' (inferred) hop-type, and is treated the same in scoping/transclusion decisions as an 'E' (embed).	org.archive.modules.extractor.ExtractorHTTP
extractorHtml	This processor extracts outlinks from HTML content.	org.archive.modules.extractor.ExtractorHTML
extractorCss	This processor extracts outlinks from CSS content.	org.archive.modules.extractor.ExtractorCSS
extractorJs	This processor extracts outlinks from JavaScript content.	org.archive.modules.extractor.ExtractorJs
extractorSwf	This processor extracts outlinks from Flash content.	org.archive.modules.extractor.ExtractorSWF
extractorPdf	This processor extracts outlinks from PDF content.	org.archive.modules.extractor.ExtractorPDF
extractorXml	This processor extracts outlinks from XML content.	org.archive.modules.extractor.ExtractorXML

Most extract processors are pre-fconfigured in a job's crawler-beans.cxml configuration file under the "fetchProcessors" bean. To add a new extractor, such as an XML/RSS extractor, define the bean and then link it to the "fetchProcessors" bean. An example for the extractorXml bean is below.

Define the bean for the XML Extractor

<bean id="extractorXml" class="org.archive.modules.extractor.ExtractorXML"></bean>

Link the "extractorXml" bean to the "fetchProcessors" bean

<bean id="fetchProcessors" class="org.archive.modules.FetchChain">
<property name="processors">
<list>
<!-- re-check scope, if so enabled... -->
<ref bean="preselector"/>
<!--
...then verify or trigger prerequisite URIs fetched, allow crawling...
-->
<ref bean="preconditions"/>
<!-- ...fetch if DNS URI... -->
<ref bean="fetchDns"/>
<!-- <ref bean="fetchWhois"/> -->
<!-- ...fetch if HTTP URI... -->
<ref bean="fetchHttp"/>
<!-- ...extract outlinks from HTTP headers... -->
<ref bean="extractorHttp"/>
<!-- ...extract outlinks from HTML content... -->
<ref bean="extractorHtml"/>
<!-- ************ ...extract outlinks from XML/RSS content.. ********** -->
<ref bean="extractorXml"/>
<!-- ...extract outlinks from CSS content... -->
<ref bean="extractorCss"/>
<!-- ...extract outlinks from Javascript content... -->
<ref bean="extractorJs"/>
<!-- ...extract outlinks from Flash content... -->
<ref bean="extractorSwf"/>
</list>
</property>
</bean>

Fetch Chain Processors - internetarchive/heritrix3 GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️