crawl rate considerations - internetarchive/heritrix3 GitHub Wiki

Why isn't Heritrix crawling as fast as I expected?

We are often asked why Heritrix is crawling slower than expected, and the answer can usually be divided into the following considerations:

Politeness or resource optimization?

The important factor to consider is whether you are crawling a small number of sites or a large number (having many independent queues). In the case of the former, your politeness policy and/or coordination with site maintainers is your primary concern, for the latter, resource optimizations (like using more RAM or a different disk layout) may be of benefit.