Responsible Crawling - internetarchive/heritrix3 GitHub Wiki

Responsible crawling means following the laws and established conventions of web crawling, to minimize the costs the crawling imposes on collected sites.

Key practices include:

respecting robots.txt except when you've been given explicit permission to do otherwise
providing contact information in your User-Agent, and responding promptly to all contacts
using politeness-delay and other settings that effect the frequency of hits to a single site to ensure most of site's serving capacity remains available for other visitors
regularly monitoring crawler logs for evidence of unproductive/endless paths (traps), and actively adjusting the crawler to stop such activity when observed

Responsible Crawling - internetarchive/heritrix3 GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️