How To Crawl In Parallel - michaeltelford/wgit GitHub Wiki

If we want to crawl individual URLs and know the URLs ahead of time, then we can crawl in parallel using Ruby's built-in Thread class. For example:

main.rb

require 'wgit'
require 'wgit/core_ext'

urls = %w[
  https://daveceddia.com/tutorial-trap/
  https://daveceddia.com/how-i-learn-things/
].to_urls

crawler = Wgit::Crawler.new
handler = lambda { |doc| puts "#{doc.title} - #{doc.description}" }

urls.map! do |url|
  Thread.new { crawler.crawl url, &handler }
end

urls.each &:join

Run the script with:

$ ruby main.rb 
How I Learn New Things - Someone asked recently what my learning strategy was… how do I learn new things?
The Tutorial Trap - Sometimes it's better to venture out on your own.

Notice how we create a single handler that gets passed to each thread to handle its crawled url/document. We then call join on the array of threads and wait for them to finish.

We can also employ the use of threads when crawling a site. For example, using the same handler as before:

threads = []

crawler.crawl_site url do |doc|
  threads << Thread.new { handler.call doc }
end

threads.each &:join

This won't crawl each page in parallel but it will handle each page in parallel, speeding up the overall execution. This is particularly effective when you're doing a lot of processing per page.

This is how the broken_link_finder gem uses Wgit under the hood - each crawled page on a site is passed to a thread which checks that document's links, returning those which are broken. The resulting speed increase on a large site is massive.

An alternative method for crawling in parallel (that contains less boiler plate) is to use the parallel gem.

Here's a similar crawl example as above, but with an added benchmark showing the total crawl time is roughly equal to that of the slowest response:

require 'wgit'
require 'wgit/core_ext'
require 'parallel'
require 'benchmark'

NUM_THREADS = 2

Wgit.logger.level = Logger::DEBUG

urls = %w[
  https://daveceddia.com/tutorial-trap/
  https://daveceddia.com/how-i-learn-things/
].to_urls

crawler = Wgit::Crawler.new
docs = []

time = Benchmark.measure do
  docs = Parallel.map(urls, in_threads: NUM_THREADS) do |url|
    crawler.crawl(url)
  end
end

puts "Crawled #{docs.size} documents (in #{time.real.round(2)} seconds):"
docs.each { |doc| puts " - #{doc.url} (#{doc.size} bytes)" }

Which outputs:

$ ruby main.rb
[wgit] [http] Request: https://daveceddia.com/tutorial-trap/
[wgit] [http] Request: https://daveceddia.com/how-i-learn-things/
[wgit] [http] Response: 200 (10819 bytes in 0.189 seconds)
[wgit] [http] Response: 200 (12078 bytes in 0.2 seconds)
Crawled 2 documents (in 0.23 seconds):
 - https://daveceddia.com/tutorial-trap/ (10818 bytes)
 - https://daveceddia.com/how-i-learn-things/ (12077 bytes)