Why Doesn't Wgit Crawl In Parallel? - michaeltelford/wgit GitHub Wiki

Many web crawlers boast parallelism out of the box, by sending multiple HTTP requests at once. Wgit doesn't work like this. Everything in Wgit is by default, performed in sequence, mainly for simplicity and better predictability of results.

Since Wgit is a library however, it is possible to call its functionality inside parallel constructs such as threads etc. But to do so, you need to know the URLs you want to crawl ahead of time e.g.

urls = [...] # Known ahead of time, before we start crawling

urls.map do |url|
  Thread.new { crawler.crawl(url) }
end.each(&:join)

Wgit is thread safe for this very reason. More examples of this can be found in this wiki article.

Wgit's Crawler#crawl_site and Indexer#index_site methods will crawl all internal links within the site's host in sequence - the same order that they're found and parsed from the HTML. This is deliberate to ensure the crawls are easy to understand and track. But also because, during benchmarking of parallelism using the async gem, it was found that the speed increase was modest to non existent.

Benchmarking of Wgit and the async gem was tested at various levels including:

  • Wgit::Crawler#crawl_site
  • Wgit::Crawler#crawl_urls
  • Wgit::Indexer#index_site
  • Wgit::Indexer#index_www

In all experiments, it was found to have minimal positive impact on performance of crawling. And the added downside was that crawling in parallel makes it less deterministic overall. The price to pay is in no way worth the gain (since there is no gain).

The main reasons why a performance improvement wasn't noticed are:

  • Servers typically rate limit requests, especially from crawlers, so crawling in parallel wasn't faster overall.
  • Bottlenecks elsewhere, which aren't addressed by crawling in parallel.

What does make a difference to overall crawl performace of sites is:

  • DNS lookup caching
  • TCP/TLS connection re-use (avoiding new handshakes that require additional round trips to the server)

Both of these factors are already being utilised in Wgit's networking requests, using libcurl under the hood. Overall, these optimisations are much more valuable than parallel crawling, and have no downsides to boot.

Therefore, Wgit has no future intentions of crawling in parallel. It's simply great as is :-)