Responsible Crawling - internetarchive/heritrix3 GitHub Wiki
Responsible crawling means following the laws and established conventions of web crawling, to minimize the costs the crawling imposes on collected sites.
Key practices include:
- respecting robots.txt except when you've been given explicit permission to do otherwise
- providing contact information in your User-Agent, and responding promptly to all contacts
- using politeness-delay and other settings that effect the frequency of hits to a single site to ensure most of site's serving capacity remains available for other visitors
- regularly monitoring crawler logs for evidence of unproductive/endless paths (traps), and actively adjusting the crawler to stop such activity when observed