13: DESIGN A SEARCH AUTOCOMPLETE SYSTEM - swchen1234/systemDesign GitHub Wiki

Step 1 - Understand the problem and establish design scope

提问:

  • query only at the beginning or middle as well
  • how many suggestions return?
  • spell check?
  • language
  • capitalization and special characters?
  • DAU

Back of the envelope estimation

  • Assume 10 million daily active users (DAU).
  • An average person performs 10 searches per day.
  • 20 bytes of data per query string
  • 20 requests are sent for each search on average
  • ~24,000 query per second (QPS) = 10,000,000 users * 10 queries / day * 20 characters / 24 hours / 3600 seconds. Peak QPS = QPS * 2 = ~48,000
  • Assume 20% of the daily queries are new. 10 million * 10 queries / day * 20 byte per query * 20% = 0.4 GB. This means 0.4GB of new data is added to storage daily.

Step 2 - Propose high-level design and get buy-in

Data gathering service

Assume we have a frequency table

  • Query: it stores the query string.
  • Frequency: it represents the number of times a query has been searched.that stores the query string and its frequency

Query service

通过如下sql得到,在data少的情况下是可行的,但不适合数据多的情况。

Step 3 - Design deep dive

Trie data structure

构造trie, 且在每个叶节点存放该单词的freq,改进:

  • Limit the max length of a prefix 这样find prefix的时间就从O(p)减少成了O(small constant)

  • Cache top search queries at each node 极大的减少时间,但会增加space

改进后的时间复杂度:

  1. Find the prefix node. Time complexity: O(1)
  2. Return top k. Since top k queries are cached, the time complexity for this step is O(1).

Data gathering service

  • 实时更新数据不可行,而且top results也许很不经常变化. 我们根据 where data comes from and how data is used来进行设计。
  • 尽管不同系统不同,但共同点是:data used to build the trie is usually from analytics or logging services.

Analytics Logs

  • append-only, not indexed.

Aggregators

  • Analytics logs的数据太大,我们需要aggregate.
  • 根据应用场景决定aggregate的频率

Aggregated Data

Workers

Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB.

Trie Cache

Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB.

Trie DB

Trie DB is the persistent storage. Two options are available to store the data:

  1. Document Store: 因为trie每周新建一颗, 我们可以定期take snapshot of the doc, serialize it and store it.
  2. Key-value store: A trie can be represented in a hash table by setting
  • hash key: prefix in the trie
  • hash value: data on each trie node

Query service

速度优化:

  • AJAX request. For web applications, browsers usually send AJAX requests to fetch autocomplete results. The main benefit of AJAX is that sending/receiving a request/response does not refresh the whole web page.
  • Browser caching. autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results from the cache directly. Google caches the results in the browser for 1 hour. Data sampling:only 1 out of every N requests is logged by the system.

Trie operations

  • Create Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB.
  • Update Option 1: Update the trie weekly Option 2: Update individual trie node directly => 慢!但对small trie可行,注意更新一个node时,其向上一直到根结点的node全部都要更新。
  • Delete We add a filter layer (Figure 13-14) in front of the Trie Cache to filter out unwanted suggestions.

Scale the storage

如果仅依照字母顺序sharding, 会有data uneven distribution问题 =》 分析历史频率分布,由shard map manager 维持一个lookup database以决定存储位置。

Step 4 - Wrap up

  • multiple languages? 使用Unicode characters in trie nodes
  • What if top search queries in one country are different from others? 不同国家用不同trie
  • How can we support the trending (real-time) search queries?
    • Reduce the working data set by sharding.
    • Change the ranking model and assign more weight to recent search queries.
    • Data may come as streams, so we do not have access to all the data at once. Streaming data means data is generated continuously.