Responsible Crawling - internetarchive/heritrix3 GitHub Wiki

Responsible crawling means following the laws and established conventions of web crawling, to minimize the costs the crawling imposes on collected sites.

Key practices include:

  • respecting robots.txt except when you've been given explicit permission to do otherwise
  • providing contact information in your User-Agent, and responding promptly to all contacts
  • using politeness-delay and other settings that effect the frequency of hits to a single site to ensure most of site's serving capacity remains available for other visitors
  • regularly monitoring crawler logs for evidence of unproductive/endless paths (traps), and actively adjusting the crawler to stop such activity when observed
⚠️ **GitHub.com Fallback** ⚠️