9: DESIGN A WEB CRAWLER - swchen1234/systemDesign GitHub Wiki

A crawler is used for many purposes:

  • Search engine indexing: This is the most common use case. A crawler collects web pages to create a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine.
  • Web archiving: This is the process of collecting information from the web to preserve data for future uses. For instance, many national libraries run crawlers to archive web sites. Notable examples are the US Library of Congress [1] and the EU web archive [2].
  • Web mining: The explosive growth of the web presents an unprecedented opportunity for data mining. Web mining helps to discover useful knowledge from the internet. For example, top financial firms use crawlers to download shareholder meetings and annual reports to learn key company initiatives.
  • Web monitoring. The crawlers help to monitor copyright and trademark infringements over the Internet. For example, Digimarc [3] utilizes crawlers to discover pirated works and reports.

Step 1 - Understand the problem and establish design scope

Designing a vastly scalable web crawler is an extremely complex task.

可以提的问题:

  • What is the main purpose of the crawler?
  • How many web pages does the web crawler collect per month?
  • What content types are included?
  • How long do we need to store HTML pages crawled from the web?

Back of the envelope estimation

  • Assume 1 billion web pages are downloaded every month.
  • QPS: 1,000,000,000 / 30 days / 24 hours / 3600 seconds = ~400 pages per second.
  • PeakQPS=2*QPS=800
  • Assume the average web page size is 500k.
  • 1-billion-page x 500k = 500 TB storage per month. If you are unclear about digital storage units, go through “Power of 2” section in Chapter 2 again.
  • Assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB. A 30 PB storage is needed to store five-year content.

Step 2 - Propose high-level design and get buy-in

Seed URLs

A web crawler uses seed URLs as a starting point for the crawl process.

URL Frontier

The component that stores URLs to be downloaded is called the URL Frontier. You can refer to this as a First-in-First-out (FIFO) queue.

DNS Resolver

To download a web page, a URL must be translated into an IP address. The HTML Downloader calls the DNS Resolver to get the corresponding IP address for the URL. For instance, URL www.wikipedia.org is converted to IP address 198.35.26.96 as of 3/5/2019.

Content Parser

After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component.

Content Seen?

An efficient way to accomplish this task is to compare the hash values of the two web pages.

Content Storage

  • Most of the content is stored on disk because the data set is too big to fit in memory.
  • Popular content is kept in memory to reduce latency.

URL Extractor

URL Filter

URL Seen?

Bloom filter and hash table

URL Storage

Step 3 - Design deep dive

Important building components

DFS vs BFS

  • BFS
  • BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO)
  • two problems:
    • Most links from the same web page are linked back to the same host, making the crawler busy processing URLs from the same host. The host servers will be flooded with requests. This is considered as “impolite”.
    • Standard BFS does not take the priority of a URL into consideration.

URL frontier

A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness.

Politeness

Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. 如果太多会被认为是impolite,甚至 denial-of-service (DOS) attack。 => 解决:

  • Download one page at a time from the same host. Implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue.

Priority

We prioritize URLs based on usefulness, which can be measured by PageRank [10], website traffic, update frequency, etc.

Combine them together

Freshness

Few strategies to optimize freshness are listed as follows:

  • Recrawl based on web pages’ update history.
  • Prioritize URLs and recrawl important pages first and more frequently.

Storage for URL Frontier

  • The majority of URLs are stored on disk
  • buffers in memory for enqueue/dequeue operations, and 定期写到 disk.

HTML Downloader

Robots.txt

Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.

  • 为了防止反复下载robots.txt, 定期下载并存于cache.

Performance optimization

  1. Distributed crawl
  • crawl jobs are distributed into multiple servers, and each server runs multiple threads.
  • The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs.
  1. Cache DNS Resolver DNS requests might take time due to the synchronous nature of many DNS interfaces.

  2. Locality DNS requests might take time due to the synchronous nature of many DNS interfaces.

  3. Short timeout 一些server反应迟钝,maximal wait time is specified.

Robustness

  • Consistent hashing
  • Save crawl states and data
  • Exception handling
  • Data validation

Extensibility

Detect and avoid problematic content

  1. Redundant content => Hashes or checksums

  2. Spider traps A spider trap is a web page that causes a crawler in an infinite loop. e.g. www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/...

  • 难以设计一个algo发现所有spider trap, 但可以手动发现。
  1. Data noise 去掉没有意义的contents.

Step 4 - Wrap up

Additional talking points:

  • Server-side rendering: 大量网站当场生成link, 如果直接下载并parse, 无法得到这些自动生成的link.
  • Filter out unwanted pages an anti-spam component
  • Database replication and sharding
  • Horizontal scaling 关键是keep servers stateless
  • Availability, consistency, and reliability
  • Analytics