Case Study: Messenger Clone - rFronteddu/general_wiki GitHub Wiki

Messenger Clone

Requirements and goals

Functional requirements

one-on-one convs between users
keep track of online/offline status of users
store chat history

Non-functional requirements:

Real-time chat with minimal latency
Highly consistent, same history in all-devices
Highly available, can be scarified for consistency.

Extra

Support group chat
Notify users of new messages when they are offline

Capacity estimation and constrains

Assume 500M daily active users
Assume on avg, each user sends 40 message daily
=> 20B messages per day.

Storage estimation

Assume each msg is 100bytes, to store all the messages we need 20B * 100bytes = 2TB/day
To store five years we need 2TB * 365 DAYS * 5 YEARS ~ 3.6PB
We also need to store user info and message metadata (ID, Timestamp) but the above calculation does not take into consideration data compression and replication.

Bandwidth Estimation

With 2TB of data every day, we have 25MB of incoming data (2TB / 86400 sec)
Each message has to go to at least another user so we will have 25MB/s outgoing too

High Level Design

We will have a chat server that will orchestrate comms between users. When a user want sto send a message, they will connect to the chat server and send the message to the server. The server then will pass the message to other users and store it in in a DB. The server sends acknowledgments to sources when messages are received by the server and when they are delivered to destinations. The clients send acknowledgments to servers each time they receive a message.

HL Design

Detailed Component Design

Message Handling

We need to decide how to efficiently send/receive data

To send a message, a user needs to connect to a server and post a message for the other users. To get a message from the server a user can either periodically pull or keep an open connection that the server uses to notify the client of new messages. Pull is not scalable because it causes traffic even if there is nothing being exchanged between users. We would also have to pull frequently to minimize latency. We then prefer a push model that has the server maintaining a connection with each client.

How to maintain an open connection with the server

We can use LongPolling or WebSocket. WebSocket is faster and uses fewer resources so we prefer it in this case.

How to keep track of opened connection to redirect messages to use efficiently?

Server maintains a hash table, the key is the userID and the value is the connection object. Whenever a server receives a message for a user, it looks up that user in the table to find the connection and send the message.

What to do when a server receives a message for a user that is offline?

If the receiver disconnects, the server can notify the sender about the delivery failure. We can embed the retry logic in the client so that user's don't have to retype message or store the message in the server and retry sending once the receiver reconnects.

How many chat servers we need?

If we plan for 500 M connections at any time, and assuming modern server can handle 50K concurrent connections at any time we would need at least 10K servers.

How do we know which server holds the connection to which user?

We can use a software load balancer in front of our chat servers that map each UserID to a server to redirect requests.

How does a server process a "deliver message" request

Upon receiving a message

Server stores the message to the db first to avoid loss (we don't need to wait for this to complete)
send message to receiver (find server that holds connection with receiver and pass message to it)
ack message to sender

How do we maintain sequencing of messages?

We will keep a sequence number for every message for each client. Different users may see a difference sequence but the experience will be consistent across devices.

Storing and retrieving messages from the db

Whenever the server receives a new message, it stores it in the DB

We need a db that can support huge number of small messages for insertion and that is interested in sequential access of messages. We can use a wide column solution like HBase that can store multiple values against one key into multiple columns.

How can clients efficiently fetch data from server?

Clients can paginate while fetching data, smaller screen can request less data.

Managing user's status

To track online/offline status and notify relevant users when something change, we can do the following:

When a client start the app, it pulls the current status of all users in his friends' list
Whenever a user sends a message to another user that is offline, we can send a failure to the sender and update the status of the client.
When an user comes online, the server can broadcast that status with a delay of few seconds to see if the users will go offline immediately
Client's can pull the status form the server about those users that are being shown on the user's viewport. This should not be a frequent operation, as the server broadcasts the online status of users and we can live with the stale offline status of users for a while.
Whenever a client starts a new chat, we can pull the status of the user at that time

Design Summary

Clients will open a connection to the chat server to send a message; the server will then pass it to the requested user.
All the active users will keep a connection open with the server to receive messages.
Whenever a new message arrives the chat server will push it to the receiving user using a websocket.
Messages can be stored in HBase, to support quick small updates, and range based searches.
Servers can broadcast the online status of a user to other relevant users.
Clients can pull status updates for users who are visible in their viewport on a less frequent basis.

Partitioning

We need to store lots of data, we need a partitioning scheme.

Partitioning on UserID

If we partition based on the hash of the UserID so that we can keep the messages of a user ont he same db. If one shart is 4TB, we will have 3.6PB/4TV ~ 900 shards for five years.

If we keep 1K shards, we find the shard doing hash(UID) % 1000 and then store/retrieve the data from there. This will be very quick to query.

Since we need to potentially store an unlimited history of messages, we can start with a big number of logical partitions, which will be mapped to fewer physical servers, and as hour storage increases, add more physical servers to distributed the logical partition.

Partitioning on MessageID:

Storing messages of a user on separate shards will make fetching ranges very slow.

Cache

We can cache a few recent messages (say last 15) in a few recent conversations that are visible in a user's viewport.

LB

We will need a lb in front of the chat servers to map each UserID to a server that holds the connection for the user and then direct the request to that server. Similarly, we for the cache servers.

Fault tolerance and replication

What happens when a chat server fails

Clients will need to keep retrying to connect with the server.

Multiple copies of user messages?

We can either have a backup or use techniques like Reed-Solomon encoding to distribute and replicate it.

Extended requirements

Group chat

We can have a group-chat object identified by a GroupChatID that contains a list of users. When a message is sent to the group, the lb can send the message to server managing that group that can then send the message to the members of the group. We can store the group chats in a separate table parittioned based on GroupChatID

Push Notification

We can opt-in from their device to get notifications whenever there is a new message or event. To do this we would need to setup a Notification server which will take the messages for offline users and send them to the push mechanism.

DL Design