Typeahead - rFronteddu/general_wiki GitHub Wiki
Typeahead
System Requirements
Functional requirements:
- As the user types in their query, our service should suggest top 10 terms starting with whatever the user has typed.
Non functional requirements:
- The suggestions should appear in real-time, in less than 200ms.
Basic System Design and Algorithm
We have a lot of strings we need to store in such a way that users can search with any prefix. Our service will suggest the next terms that will match the given prefix.
We need to come up with a scheme that can efficiently store data for quick queries. We need to store an index in memory in a highly efficient data structure.
One appropriate approach is the Trie (pronounced try), a tree-like data structure used to store phrases where each node stores a character of the phrase in a sequential manner. Note that we can merge nodes that have only one branch to save storage space.
For simplicity, assume case insensitive data
How to find top suggestions?
One solution is to store the count of searches that terminated at each node, e.g., if users have searched about CAPTAIN 100 times and CAPTION 500 times, restore this number with the last character of the phrase. So now if a user types CAP, we know the top most searched word under the prefix CAP is CAPTION. Given a prefix, we can traverse the sub-tree under it to find the top suggestions.
Given a prefix, how much time it takes to traverse its sub-tree?
Given the amount of data we need to index, we should expect a huge tree. Even traversing a sub-tree would take really long, we need to split this tree.
Can we store top suggestions with each node?
We can at the cost of a lot of extra storage. We could store the top 10 suggestions at each node that we can return to the user. We could store only references of the terminal nodes rather than the entire phrase and then traverse back using the parent reference. We would also need to store the frequency with each reference to keep track of top suggestions.
How to build this trie
We can build it bottom up. Each parent node can recursively call all the child nodes to calculate their top suggestions and their counts. Parent nodes will combine top suggestions from all of their children to determine their top suggestions.
How to update the trie:
*Assume 5 billion searches per day, approximately 60K queries per second. Updating the trie for every query would be too resource intensive. One solution could be to update the trie periodically.
- As new queries comes in, we can log them and track their frequencies.
- We can have a Map-Reduce (MR) set-up to process all the logging data periodically for example every hour.
- These MR jobs can calculate frequencies of all searched terms in the past hour, we can then take these values and update the current snapshot of the trie.
- We can do this offline to not block read queries. We have two options:
- Make a copy of the trie on each server to update it offline. Once done we can switch to start using it an discard the old one.
- We can have a master-slave configuration for each trie server. We can update the slave while the master is serving traffic. Once the update is complete, we can make the slave the new master. We can later update the old master too which can then start serving traffic again.
How to update frequencies of typeahead suggestions?
- Since we store frequencies of suggestions with each node, we need to update them too.
- We can update only differences in frequencies rather than recounting all search terms from scratch.
- If we keep count of all the terms searched in the last 10 days, we can use an Exponential Moving Average of each term. In EMA, we give more weight to the latest data.
- After inserting a new term in the trie, we go to the terminal node of the phrase and increase its frequency.
- Since we are storing the top 10 queries in each node, it is possible that this particular search term jumped into the top 10 queries of a few other nodes and that we need to update the top 10 queries for them. To do this we traverse the parents to the root and check if the current query is part of the top 10, if so, we update the frequency, otherwise we check if the query's frequency is high enough to replace one of the ranked items. If so we replace the term with the lowest frequency.
How to remove a term from the trie?
To remove a term from the trie, we can do a full removal when the update cycle happens and implement a filter layer one each server to remove items before sending to users.
Other ranking criteria for suggestions?
We could also use factors such as freshness, user location, language, demographics, and personal history.
Permanent Storage of the Trie
Store the trie
- We can periodically take a snapshot of the trie and store it in a file. We can use these snapshots to rebuild a trie if the server goes down. We can start at the root node and save the trie level-by-level, storing for each node what character it contains and how many children it has.
- We can recalculate top terms while we are rebuilding the trie. Each node can calculate its top suggestions and pass it to the parent. Each parent will merge results with its children to figure out its top suggestions.
Scale Estimation
If we are building a service in the same scale as Google, we can expect 5 billion searches a day, approximately 60K query per second.
We could also expect several duplicates, using the usual rule, assume 20% of them to be unique. If we only want to index the top 50% of the search terms, we can get rid of a lot of less frequently searched queries. Assume to have 100 million unique terms for which we want to build an index.
Storage Estimation
If on avg, each query consists of 3 words and if the average length of a word is 5 characters, this will give us 15 characters of average query size. Assuming 2 bytes per character, we will need 30 bytes for an average query => 100 million * 30 bytes = 3GB
We can expect some growth in this data every day, we should also be removing some terms that are not searched anymore. If we assume 2% new queries per day, and if we maintain an index for the last year, total storage should approximate: 3GB + (0.0.2 * 3GB * 365 days) = 25GB
Data partition
Although our index can fit on one server, we can split it to improve efficiency and lower latencies.
- Range based: Storing phrases in separate partitions based on their first letter would quickly become unbalanced.
- Partition based on server capacity: Partitioning the trie based on server memory (break a trie when it cannot fit) will still lead to hostspots.
- Hash of a term: Each search term will be passed to a hash function, which will generate a server number and we wills tore the term on that server. This will reduce hotspots. To find typeahead suggestions for a term we will have to ask all the servers and then aggregate the results.
Cache
Caching top searched terms will be extremely helpful. A small amount of queries will be responsible for most of the traffic. We can then cache most frequently searched terms and their typeahead suggestions.
We can also build a simple ML model to try and predict the engagement on each suggestion based on simple counting, personalization, or trending data, and cache these terms.
Replication and LB
We should have replicas for our trie servers both for LB and also for fault tolerance. We also need a LB to keep track of the data partitioning scheme and redirects traffic based on the prefixes.
Fault Tolerance
If the trie server goes down, a we can use a master-slave configuration so that the slave can take over after a failover. Any server that comes back up, can rebuild the trie based on the last snapshot.
Typeahead Client
We can perform some optimization on the client to improve user's experience
- Only hit server if user has not pressed any key for 50ms
- Cancel in progress requests if user is typing
- Wait for a couple of characters before the first request
- Client can pre-fetch some data from server to save future requests
- Client can pre establish connection with server to get a quicker reply when it starts typing
- Server can push some cache to CDNs for efficiency.
Personalization
User can receive suggestions based on historical searches, location, language, etc. We can store the personal history of each user separately on the server and cache them on the client too. The server can add these terms in the final set before sending the set to the user.