Home - xtrmstep/Requiem.CrawlerService GitHub Wiki

Description of the service prerequisites, architecture, data models and other aspects.

Crawler Prerequisites

It requires to satisfy the following:

Respect HTTP cache information such as If-Modified-Since, Last-Modified, etc.
Identify your crawler in the User-Agent HTTP header
Respect robots.txt
- Add a delay of at least 1 second between each query
- Obey any crawling speed limitations (Crawl-Delay)

If you do not follow these simple rules, you might end up being blocked as a Bad Crawler. Also reading the HTML specification (RFC 2616) might be useful.

Crawler Service Architecture

The service is intended only for downloading web pages for later processing. The process overview:

for each URL from a queue (one by one) obtain a crawler settings (if not yet obtained, load robots.txt or default), after that use it always for this host
download the main page (by given URL) and store it for URL recognition routine
recognize outgoing URLs and store them in a queue
repeat

URL is added with information WHEN it should be processed and the queue is sorted according to that information.

Domain Model

This is domain model of the Crawler.

Blue color blocks represent information which is used and produced by the Crawler. URL item and domain are parts of the URL Frontier. Crawler Data Blocks are produced from a parsed URL using rules for its URL domain. Rules can be specified for each domain separately. Lately the crawled data blocks will be used for analysis to find photos and videos.

Orange color blocks represent the crawler engine. The runner picks out URL item and using domain setting schedules the parser to download and process a web page.

Green blocks represent the crawler settings.

Crawler Simple Algorithm

Pick out an URL from URL Frontier
URL Frontier is sorted according to time when a resource allowed to be requested.
If the URL cannot be requested, just return and wait some delay.
Mark it as “in-process” and set a time when processing got started.
Obtain crawler settings for the URL (a stored one, robots.txt from the resource or a default one). This information will be used for new URLs which would be added to the URL Frontier.
Download a page from the URL
Extract cache information from headers for later invalidation.
Store the downloaded page with cache info for analytical system.
Extract new URLs and add them to the URL Frontier with allowed time to request (in order).
Remove the URL from the URL Frontier and stop.

#See also

What should I specify in User Agent header pf the crawler?