13: DESIGN A SEARCH AUTOCOMPLETE SYSTEM - swchen1234/systemDesign GitHub Wiki

Step 1 - Understand the problem and establish design scope

提问：

Assume 10 million daily active users (DAU).
An average person performs 10 searches per day.
20 bytes of data per query string
20 requests are sent for each search on average
~24,000 query per second (QPS) = 10,000,000 users * 10 queries / day * 20 characters / 24 hours / 3600 seconds. Peak QPS = QPS * 2 = ~48,000
Assume 20% of the daily queries are new. 10 million * 10 queries / day * 20 byte per query * 20% = 0.4 GB. This means 0.4GB of new data is added to storage daily.

Assume we have a frequency table

Query: it stores the query string.
Frequency: it represents the number of times a query has been searched.that stores the query string and its frequency

通过如下sql得到，在data少的情况下是可行的，但不适合数据多的情况。

构造trie，且在每个叶节点存放该单词的freq，改进：

改进后的时间复杂度：

Find the prefix node. Time complexity: O(1)
Return top k. Since top k queries are cached, the time complexity for this step is O(1).

实时更新数据不可行，而且top results也许很不经常变化. 我们根据 where data comes from and how data is used来进行设计。
尽管不同系统不同，但共同点是：data used to build the trie is usually from analytics or logging services.

Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB.

Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB.

Trie DB is the persistent storage. Two options are available to store the data:

Document Store: 因为trie每周新建一颗，我们可以定期take snapshot of the doc, serialize it and store it.
Key-value store: A trie can be represented in a hash table by setting

速度优化：

AJAX request. For web applications, browsers usually send AJAX requests to fetch autocomplete results. The main benefit of AJAX is that sending/receiving a request/response does not refresh the whole web page.
Browser caching. autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results from the cache directly. Google caches the results in the browser for 1 hour. Data sampling：only 1 out of every N requests is logged by the system.

Create Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB.
Update Option 1: Update the trie weekly Option 2: Update individual trie node directly => 慢！但对small trie可行，注意更新一个node时，其向上一直到根结点的node全部都要更新。
Delete We add a filter layer (Figure 13-14) in front of the Trie Cache to filter out unwanted suggestions.

如果仅依照字母顺序sharding, 会有data uneven distribution问题 =》分析历史频率分布，由shard map manager 维持一个lookup database以决定存储位置。

multiple languages? 使用Unicode characters in trie nodes
What if top search queries in one country are different from others? 不同国家用不同trie
How can we support the trending (real-time) search queries?
- Reduce the working data set by sharding.
- Change the ranking model and assign more weight to recent search queries.
- Data may come as streams, so we do not have access to all the data at once. Streaming data means data is generated continuously.