How To Crawl In Parallel - michaeltelford/wgit GitHub Wiki
The Wgit library crawls URLs in sequence, not parallel. If we want to crawl individual URLs which we know ahead of time, then we can wrap the individual crawls in a thread (or other parallel construct). Wgit's API is thread safe to enable us to do so safely.
This article demonstrates examples of parallelism with Wgit through the use of:
If you're curious as to why parallelism isn't built into Wgit, check out this Wiki article: Why Doesn't Wgit Crawl In Parallel?
Ruby's Thread Class
We can leverage Ruby's built in Thread class e.g.
main.rb
require 'wgit'
require 'wgit/core_ext'
urls = %w[
https://daveceddia.com/tutorial-trap/
https://daveceddia.com/how-i-learn-things/
].to_urls
crawler = Wgit::Crawler.new
handler = lambda { |doc| puts "#{doc.title} - #{doc.description}" }
threads = urls.map do |url|
Thread.new { crawler.crawl url, &handler }
end
threads.each &:join
Run the script with:
$ ruby main.rb
How I Learn New Things - Someone asked recently what my learning strategy was… how do I learn new things?
The Tutorial Trap - Sometimes it's better to venture out on your own.
Notice how we create a single handler that gets passed to each thread to handle its crawled document. We then call join on the array of threads and wait for them to finish.
We can also employ the use of threads when crawling a site, but not to crawl in parallel; instead we're simply handing each crawled document block off to a thread for parallel processing. For example, using the same handler as before:
threads = []
# crawl_site will crawl each parsed internal link in sequence (order of being found)
crawler.crawl_site url do |doc|
threads << Thread.new { handler.call doc }
end
threads.each &:join
This won't crawl each page in parallel but it will handle each page in parallel, speeding up the overall execution. This is particularly effective when you're doing a lot of processing per page.
This is how the broken_link_finder gem uses Wgit under the hood - each crawled page on a site is passed to a thread which checks that document's links, returning those which are broken. The resulting speed increase on a large site is massive.
Parallel Gem
An alternative method for crawling in parallel (that contains less boiler plate than the Thread class) is to use the parallel gem.
Here's a similar crawl example as above, but with an added benchmark showing the total crawl time is roughly equal to that of the slowest response:
require 'wgit'
require 'wgit/core_ext'
require 'parallel'
require 'benchmark'
NUM_THREADS = 2
Wgit.logger.level = Logger::DEBUG
urls = %w[
https://daveceddia.com/tutorial-trap/
https://daveceddia.com/how-i-learn-things/
].to_urls
crawler = Wgit::Crawler.new
docs = []
time = Benchmark.measure do
docs = Parallel.map(urls, in_threads: NUM_THREADS) do |url|
crawler.crawl(url)
end
end
puts "Crawled #{docs.size} documents (in #{time.real.round(2)} seconds):"
docs.each { |doc| puts "- #{doc.title} - #{doc.description}" }
Which outputs:
$ ruby main.rb
[wgit] [http] Request: https://daveceddia.com/tutorial-trap/
[wgit] [http] Request: https://daveceddia.com/how-i-learn-things/
[wgit] [http] Response: 200 (10819 bytes in 0.103 seconds)
[wgit] [http] Response: 200 (12078 bytes in 0.103 seconds)
Crawled 2 documents (in 0.14 seconds):
- The Tutorial Trap - Sometimes it's better to venture out on your own.
- How I Learn New Things - Someone asked recently what my learning strategy was… how do I learn new things?