Instant Messaging Platform - ashtishad/system-design GitHub Wiki

1. Requirements

Functional Requirements:

Users can send/receive 1:1 messages in real time.
Users can create/join group chats (up to 1K members).
Users can view message history.

Non-Functional Requirements:

Real-Time Delivery: <1s latency for messages.
Scalability: Handle surges (e.g., 10% peak events like New Year’s).
Availability: 99.99% uptime.
Durability: No message loss over 5 years.
Capacity Estimation (5 years):
- DAU: 500M users.
- Messages: 100B/day * 365 * 5 = 182.5T messages.
- Storage:
  - Message: 1KB (text + metadata) * 182.5T = 182.5PB raw, ~1.8EB with replication/indexes.
  - Users: 500M * 1KB = 500GB.
- QPS:
  - Avg: 100B/day ÷ 86,400s ≈ 1.16M QPS.
  - Peak (10% events): 50M users * 20 msg/user ÷ 2,400s (busy hour) ≈ 416K QPS.

2. Core Entities

User: {id, username, phone, status}
Message: {id, sender_id, receiver_id/group_id, content, timestamp, status (sent/delivered/read)}
Group: {id, name, member_ids[]}

3. APIs

POST /messages/send: {receiver_id/group_id, content}, sends message.
GET /messages/{user_id}?since={timestamp}: Returns message history.
POST /groups/create: {name, member_ids[]}, creates group.

4. High-Level Design

Client: Mobile/web app.
API Gateway: Routes, auth, rate-limiting.
Microservices:
- Chat Service: Manages 1:1 and group messaging.
- Group Service: Handles group metadata.
- Storage Service: Persists messages/users.
Data Stores:
- Cassandra: Messages/users (wide-column, high write throughput).
- Redis: Real-time delivery (pub/sub), caching.
Flow:
- Send → Chat Service → Redis pub/sub → Cassandra.
- History → Storage Service → Cassandra.
- Group → Group Service → Cassandra.

Why Cassandra?

Cassandra is chosen for its high write throughput (millions/sec), horizontal scalability, and tunable consistency. Messaging platforms need fast, durable writes (1.16M QPS avg) and can tolerate eventual consistency for reads (e.g., history). Its partition tolerance suits global distribution, critical for 500M users.

5. Deep Dives

1. Real-Time Message Delivery

Problem: Deliver messages in <1s at 416K QPS peak.
Solution:
- Redis Pub/Sub: Sender publishes to channel:user_id, receivers subscribe.
- Fallback: WebSockets for persistent connections.
Industry Example (WhatsApp): WhatsApp uses XMPP (modified) with persistent TCP connections. Messages route via a gateway to the recipient’s device if online; otherwise, queued. We use Redis for simplicity/scalability, WebSockets for reliability.
Tech Details:
- Redis: 100K ops/s per node, scales to 416K QPS with 5 nodes.
- WebSockets: 1M connections/server, heartbeat every 30s.
Tradeoffs: Redis is fast but volatile (mitigated by Cassandra persistence); WebSockets are reliable but resource-intensive.

2. Group Chat Scalability

Problem: Support 1K-member groups at scale.
Solution:
- Fan-Out on Write: Write message to each member’s inbox in Cassandra.
- Redis Cache: Store active group metadata (member list).
Industry Example (Telegram): Telegram uses a “supergroup” model, sharding large groups across servers. Messages fan out to inboxes with MTProto. We adopt fan-out for simplicity, caching for speed.
Tech Details:
- Cassandra: Partition by user_id, 416K writes/s split across 100 nodes (~4K/node).
- Redis: 1K members * 100B groups = 100GB cache.
Tradeoffs: Fan-out scales writes linearly (costly for big groups); read-based fan-out reduces writes but delays delivery.

3. Durability & Storage

Problem: Store 1.8EB over 5 years.
Solution:
- Cassandra: Multi-DC replication (3x), TTL=5 years for old messages.
- Cold Storage: Move >1-year-old data to S3.
Tech Details:
- Cassandra: 182.5PB ÷ 100 nodes = 1.8PB/node, 3x replication = 5.4PB/node.
- S3: 100PB, cheaper ($0.023/GB vs. $0.10/GB for Cassandra).
Tradeoffs: Cassandra ensures fast access but is expensive; S3 saves cost but slows retrieval.

4. Scalability for Surges

Problem: 416K QPS during peaks.
Solution:
- Load Balancers: Auto-scale Chat Service (CPU > 70%).
- Sharding: Cassandra by user_id, Redis by channel_id.
Tech Details:
- 416K QPS ÷ 100 shards = 4.16K QPS/shard, fits single node.
- Auto-scaling: Add 1 node/10K QPS, ~50 nodes at peak.

5. High Availability

Problem: 99.99% uptime.
Solution:
- Multi-Region: Cassandra (3 DCs), Redis sentinel.
- Fallback: Queue undelivered messages in Kafka.
Tech Details:
- Cassandra: Quorum reads/writes, 5ms latency.
- Kafka: 1M msg/s, 1-day retention.

Summary of Solutions and Industry Practices

Authentication: PostgreSQL + JWT + Redis – WhatsApp’s secure login.
Real-time Delivery: WebSockets + Kafka – Slack’s instant messaging.
Persistence: Cassandra + Sharding – WhatsApp’s scalable storage.
Group Messaging: PostgreSQL + Kafka + Redis – Telegram’s group efficiency.
Notifications: FCM + SQS + Retry – Slack’s reliable alerts.
Security: Signal Protocol + GDPR – WhatsApp’s privacy focus.