Processing a list of URLs - xtrmstep/Requiem.CrawlerService GitHub Wiki

There is a seed list of URLs in the URL Frontier. In order to process them data flow approach is used. TPL Dataflow library provides a basic routines to implement it.

The data flow pipeline consists of the following blocks:

  1. download a content from URL
  • pick rules to parse data blocks for the given URL
  • post each rule with the content to parse (links, photo, video, etc)
  • add extracted links to the frontier if they are new (use URL host crawl rules)
  • add other blocks to DB for later analysis
  • wait until all parsing blocks are completed (or cancelled due to error) and release the processing URL with updated available date/time.

(Before posting to the pipeline URL are obtained from the frontier and marked as locked (in progress).)

Crawler processes are not ran simultaneously. The Crawler Scheduler starts the process with some delay and each process picks a separate chunk of URLs (currently only one). This circumstance help to avoid situation when two processes work on the same URL.