Case Study: Messenger Clone - rFronteddu/general_wiki GitHub Wiki
Messenger Clone
Requirements and goals
Functional requirements
- one-on-one convs between users
- keep track of online/offline status of users
- store chat history
Non-functional requirements:
- Real-time chat with minimal latency
- Highly consistent, same history in all-devices
- Highly available, can be scarified for consistency.
Extra
- Support group chat
- Notify users of new messages when they are offline
Capacity estimation and constrains
- Assume 500M daily active users
- Assume on avg, each user sends 40 message daily
- => 20B messages per day.
Storage estimation
- Assume each msg is 100bytes, to store all the messages we need 20B * 100bytes = 2TB/day
- To store five years we need 2TB * 365 DAYS * 5 YEARS ~ 3.6PB
- We also need to store user info and message metadata (ID, Timestamp) but the above calculation does not take into consideration data compression and replication.
Bandwidth Estimation
- With 2TB of data every day, we have 25MB of incoming data (2TB / 86400 sec)
- Each message has to go to at least another user so we will have 25MB/s outgoing too
High Level Design
We will have a chat server that will orchestrate comms between users. When a user want sto send a message, they will connect to the chat server and send the message to the server. The server then will pass the message to other users and store it in in a DB. The server sends acknowledgments to sources when messages are received by the server and when they are delivered to destinations. The clients send acknowledgments to servers each time they receive a message.
Detailed Component Design
Message Handling
We need to decide how to efficiently send/receive data
To send a message, a user needs to connect to a server and post a message for the other users. To get a message from the server a user can either periodically pull or keep an open connection that the server uses to notify the client of new messages. Pull is not scalable because it causes traffic even if there is nothing being exchanged between users. We would also have to pull frequently to minimize latency. We then prefer a push model that has the server maintaining a connection with each client.
How to maintain an open connection with the server
We can use LongPolling or WebSocket. WebSocket is faster and uses fewer resources so we prefer it in this case.
How to keep track of opened connection to redirect messages to use efficiently?
Server maintains a hash table, the key is the userID and the value is the connection object. Whenever a server receives a message for a user, it looks up that user in the table to find the connection and send the message.
What to do when a server receives a message for a user that is offline?
If the receiver disconnects, the server can notify the sender about the delivery failure. We can embed the retry logic in the client so that user's don't have to retype message or store the message in the server and retry sending once the receiver reconnects.
How many chat servers we need?
If we plan for 500 M connections at any time, and assuming modern server can handle 50K concurrent connections at any time we would need at least 10K servers.
How do we know which server holds the connection to which user?
We can use a software load balancer in front of our chat servers that map each UserID to a server to redirect requests.
How does a server process a "deliver message" request
Upon receiving a message
- Server stores the message to the db first to avoid loss (we don't need to wait for this to complete)
- send message to receiver (find server that holds connection with receiver and pass message to it)
- ack message to sender
How do we maintain sequencing of messages?
We will keep a sequence number for every message for each client. Different users may see a difference sequence but the experience will be consistent across devices.
Storing and retrieving messages from the db
Whenever the server receives a new message, it stores it in the DB
- We need a db that can support huge number of small messages for insertion and that is interested in sequential access of messages. We can use a wide column solution like HBase that can store multiple values against one key into multiple columns.
How can clients efficiently fetch data from server?
Clients can paginate while fetching data, smaller screen can request less data.
Managing user's status
To track online/offline status and notify relevant users when something change, we can do the following:
- When a client start the app, it pulls the current status of all users in his friends' list
- Whenever a user sends a message to another user that is offline, we can send a failure to the sender and update the status of the client.
- When an user comes online, the server can broadcast that status with a delay of few seconds to see if the users will go offline immediately
- Client's can pull the status form the server about those users that are being shown on the user's viewport. This should not be a frequent operation, as the server broadcasts the online status of users and we can live with the stale offline status of users for a while.
- Whenever a client starts a new chat, we can pull the status of the user at that time
Design Summary
- Clients will open a connection to the chat server to send a message; the server will then pass it to the requested user.
- All the active users will keep a connection open with the server to receive messages.
- Whenever a new message arrives the chat server will push it to the receiving user using a websocket.
- Messages can be stored in HBase, to support quick small updates, and range based searches.
- Servers can broadcast the online status of a user to other relevant users.
- Clients can pull status updates for users who are visible in their viewport on a less frequent basis.
Partitioning
We need to store lots of data, we need a partitioning scheme.
Partitioning on UserID
If we partition based on the hash of the UserID so that we can keep the messages of a user ont he same db. If one shart is 4TB, we will have 3.6PB/4TV ~ 900 shards for five years.
If we keep 1K shards, we find the shard doing hash(UID) % 1000 and then store/retrieve the data from there. This will be very quick to query.
Since we need to potentially store an unlimited history of messages, we can start with a big number of logical partitions, which will be mapped to fewer physical servers, and as hour storage increases, add more physical servers to distributed the logical partition.
Partitioning on MessageID:
Storing messages of a user on separate shards will make fetching ranges very slow.
Cache
We can cache a few recent messages (say last 15) in a few recent conversations that are visible in a user's viewport.
LB
We will need a lb in front of the chat servers to map each UserID to a server that holds the connection for the user and then direct the request to that server. Similarly, we for the cache servers.
Fault tolerance and replication
What happens when a chat server fails
Clients will need to keep retrying to connect with the server.
Multiple copies of user messages?
We can either have a backup or use techniques like Reed-Solomon encoding to distribute and replicate it.
Extended requirements
Group chat
We can have a group-chat object identified by a GroupChatID that contains a list of users. When a message is sent to the group, the lb can send the message to server managing that group that can then send the message to the members of the group. We can store the group chats in a separate table parittioned based on GroupChatID
Push Notification
We can opt-in from their device to get notifications whenever there is a new message or event. To do this we would need to setup a Notification server which will take the messages for offline users and send them to the push mechanism.