13: DESIGN A SEARCH AUTOCOMPLETE SYSTEM - swchen1234/systemDesign GitHub Wiki
Step 1 - Understand the problem and establish design scope
提问:
- query only at the beginning or middle as well
- how many suggestions return?
- spell check?
- language
- capitalization and special characters?
- DAU
Back of the envelope estimation
- Assume 10 million daily active users (DAU).
- An average person performs 10 searches per day.
- 20 bytes of data per query string
- 20 requests are sent for each search on average
- ~24,000 query per second (QPS) = 10,000,000 users * 10 queries / day * 20 characters / 24 hours / 3600 seconds. Peak QPS = QPS * 2 = ~48,000
- Assume 20% of the daily queries are new. 10 million * 10 queries / day * 20 byte per query * 20% = 0.4 GB. This means 0.4GB of new data is added to storage daily.
Step 2 - Propose high-level design and get buy-in
Data gathering service
Assume we have a frequency table
- Query: it stores the query string.
- Frequency: it represents the number of times a query has been searched.that stores the query string and its frequency
Query service
通过如下sql得到,在data少的情况下是可行的,但不适合数据多的情况。
Step 3 - Design deep dive
Trie data structure
构造trie, 且在每个叶节点存放该单词的freq,改进:
-
Limit the max length of a prefix 这样find prefix的时间就从O(p)减少成了O(small constant)
-
Cache top search queries at each node 极大的减少时间,但会增加space
改进后的时间复杂度:
- Find the prefix node. Time complexity: O(1)
- Return top k. Since top k queries are cached, the time complexity for this step is O(1).
Data gathering service
- 实时更新数据不可行,而且top results也许很不经常变化. 我们根据 where data comes from and how data is used来进行设计。
- 尽管不同系统不同,但共同点是:data used to build the trie is usually from analytics or logging services.
Analytics Logs
- append-only, not indexed.
Aggregators
- Analytics logs的数据太大,我们需要aggregate.
- 根据应用场景决定aggregate的频率
Aggregated Data
Workers
Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB.
Trie Cache
Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB.
Trie DB
Trie DB is the persistent storage. Two options are available to store the data:
- Document Store: 因为trie每周新建一颗, 我们可以定期take snapshot of the doc, serialize it and store it.
- Key-value store: A trie can be represented in a hash table by setting
- hash key: prefix in the trie
- hash value: data on each trie node
Query service
速度优化:
- AJAX request. For web applications, browsers usually send AJAX requests to fetch autocomplete results. The main benefit of AJAX is that sending/receiving a request/response does not refresh the whole web page.
- Browser caching. autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results from the cache directly. Google caches the results in the browser for 1 hour. Data sampling:only 1 out of every N requests is logged by the system.
Trie operations
- Create Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB.
- Update Option 1: Update the trie weekly Option 2: Update individual trie node directly => 慢!但对small trie可行,注意更新一个node时,其向上一直到根结点的node全部都要更新。
- Delete We add a filter layer (Figure 13-14) in front of the Trie Cache to filter out unwanted suggestions.
Scale the storage
如果仅依照字母顺序sharding, 会有data uneven distribution问题 =》 分析历史频率分布,由shard map manager 维持一个lookup database以决定存储位置。
Step 4 - Wrap up
- multiple languages? 使用Unicode characters in trie nodes
- What if top search queries in one country are different from others? 不同国家用不同trie
- How can we support the trending (real-time) search queries?
- Reduce the working data set by sharding.
- Change the ranking model and assign more weight to recent search queries.
- Data may come as streams, so we do not have access to all the data at once. Streaming data means data is generated continuously.