Distributed web crawler - sovonnath/system-design GitHub Wiki

Distributed web crawler

Requirements:

NFR:

API: There are no api for this. This is a back end application.

HLD: Before we optimize this for a distributed environment, let's just build this for one data center.

Some of the options for distributed computing are:

Just have all of this run from one data center and have a shared storage but shard the crawler based on the URL

Pros: this is simple

Cons: this will increase the latency for the application to download the content
Geographically distribute the crawlers but keep the storages in one datacenter

Pros: Relatively simple but still we get the benefit of reduced crawling latency

Cons: There may still be latency when we are storing the content and other data
Implement this by distributing the crawlers in various data centers

Pros: This will reduce the latency of downloading the data

Cons: Increased coordination

Based on this option 3 looks to be the most appropriate.

In order for this to work as distributed web crawler that is across the globe we have to share the following:

Crawl-Jobs: It is the list of URLs to be crawled.
Seen-URLs: It is a list of URLs that have already been crawled.
Seen-Content: It is a list of fingerprints (hash signature/checksum) of pages that have been crawled.