12: DESIGN A CHAT SYSTEM - swchen1234/systemDesign GitHub Wiki

Step 1 - Understand the problem and establish design scope

提问:

  • 1 on 1 or group based?
  • mobile or web?
  • scale? DAU supported?
  • group member limit for group chat?
  • important features?
  • message size limit?
  • e2e encryption required?
  • chat storage

Step 2 - Propose high-level design and get buy-in

在以下的设计中, 我们将专注设计 a chat app like Facebook messenger, with an emphasis on the following features:

  • A one-on-one chat with low delivery latency
  • Small group chat (max of 100 people)
  • Online presence
  • Multiple device support. The same account can be logged in to multiple accounts at the same time.
  • Push notifications
  • 50 million DAU

Chat service 和 sender/receiver的关系

sender -> service

When the sender sends a message to the receiver via the chat service, it uses the time-tested HTTP protocol, which is the most common web protocol. In this scenario, the client opens a HTTP connection with the chat service and sends the message, informing the service to send the message to the receiver. The keep-alive is efficient for this because the keep-alive header allows a client to maintain a persistent connection with the chat service. It also reduces the number of TCP handshakes. HTTP is a fine option on the sender side, and many popular chat applications such as Facebook [1] used HTTP initially to send messages.

service -> receiver

The receiver side is a bit more complicated. Since HTTP is client-initiated, it is not trivial to send messages from the server. 常见方法:

  1. Polling client periodically asks the server if there are messages available. 基于不同的频率,cost可能很高,且多数情况下返回空。

  2. Long polling a client holds the connection open until there are actually new messages available or a timeout threshold has been reached. 一旦收到新信息,马上再发送一个request, 重新开始进程。 cons:

  • 如果client掉线,无从得知
  • 低效
  • Sender and receiver may not connect to the same chat server. HTTP based servers are usually stateless. If you use round robin for load balancing, the server that receives the message might not have a long-polling connection with the client who receives the message.
  1. WebSocket 最常见
  • 由client发起
  • It is bi-directional and persistent
  • It starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection.
  • Through this persistent connection, a server could send updates to a client.
  • sender side 也可以用web socket

High-level design

Most features (sign up, login, user profile, etc) of a chat application could use the traditional request/response method over HTTP.

three major categories

  • stateless services
  • stateful services
  • third-party integration. 最重要的是push notification

Scalability

Even at the scale we design for, it is in theory possible to fit all user connections in one modern cloud server. 但single server solution 不好,可以作为starting point. Chat servers facilitate message sending/receiving. • Presence servers manage online/offline status. • API servers handle everything including user login, signup, change profile, etc. • Notification servers send push notifications. • Finally, the key-value store is used to store chat history. When an offline user comes online, she will see all her previous chat history.

Storage

Data分为两类:

  1. Generic data, such as user profile, setting, user friends list. 通常存于relational database
  2. chat history data
  • big amount
  • only recent data are accessed frequently.
  • random access of data, e.g search, jump to message
  • read to write ratio almost 1:1. => key-value stores 原因:easily horizontal scaling/low latency/广泛使用于chat application

Data models

  • Message table for 1 on 1 chat
  • Message table for group chat
  • Message ID
  • 需要unique
  • sorted by time => 解决方案: 1)使用64-bit sequence number generator(e.g. snowflake) 2) use local sequence number generator. Local means IDs are only unique within a group, 但足够了。

Step 3 - Design Deep Dive

Service discovery

The primary role of service discovery is to recommend the best chat server for a client based on the criteria like geographical location, server capacity, etc. Apache Zookeeper被广泛使用。

  1. After the backend authenticates the user, service discovery finds the best chat server for User A. In this example, server 2 is chosen and the server info is returned back to User A.

Message flows

1 on 1 chat flow

Message synchronization across multiple devices

Small group chat flow

On the recipient side, a recipient can receive messages from multiple users. Each recipient has an inbox (message sync queue) which contains messages from different senders.

Online presence

Presence servers are responsible for managing online status and communicating with clients through WebSocket.

User login

After a WebSocket connection is built between the client and the real-time service, user A’s online status and last_active_at timestamp are saved in the KV store.

User logout

同样这只适用于small group。

Step 4 - Wrap up

  • media files(e.g. photos and videos); Compression, cloud storage, and thumbnails.
  • End-to-end encryption. Only the sender and the recipient can read messages.
  • Caching messages on the client-side 可以减少data transfer between the client and server.
  • 提高load time. Slack built a geographically distributed network to cache users’ data.
  • Error Handling
  • chat server error(If a chat server goes offline, service discovery (Zookeeper) will provide a new chat server for clients to establish new connections with.)
  • Message resent mechanism: Retry and Queueing