Food Delivery - ashtishad/system-design GitHub Wiki

1. Requirements

Functional Requirements:

Users can browse restaurants and menus.
Users can order food (place order, confirm payment).
Users can track delivery in real time.

Non-Functional Requirements (Brief):

Consistency: No duplicate orders or payments.
Scalability: Handle 10M DAU, 10% peak surges (e.g., weekends).
Low Latency: Order <500ms, tracking <1s.
Durability: No data loss over 5 years.
Capacity Estimation (5 years):
- DAU: 10M users.
- Orders: 5M/day * 365 * 5 = 9.125B orders.
- Restaurants: 100K * 50 menu items = 5M items.
- Storage:
  - Orders: 9.125B * 1KB = 9.125TB.
  - Restaurants/Menus: 5M * 500B = 2.5GB.
  - Total: ~10TB raw, ~100TB with replication/indexes.
- QPS: Avg: 5M/day ÷ 86,400s ≈ 58 QPS; Peak (10% surges): 1M * 5 orders ÷ 2,400s ≈ 2,083 QPS.

2. Core Entities

User: {id, name, address, payment_method}
Restaurant: {id, name, location (lat/lon), menu_items[]}
MenuItem: {id, restaurant_id, name, price}
Order: {id, user_id, restaurant_id, items[], total_amount, status (pending/confirmed/delivered), idempotency_key, timestamp}
Delivery: {id, order_id, driver_id, status (assigned/en_route/delivered), location (lat/lon), timestamp}

3. APIs

GET /restaurants?location={lat},{lon}&radius={radius}: Returns nearby restaurants.
GET /restaurants/{restaurant_id}/menu: Returns menu items.
POST /orders/place: {restaurant_id, items[], idempotency_key}, places order.
POST /orders/confirm: {order_id, payment_method}, confirms payment.
GET /orders/{order_id}/track: Returns delivery status/location.

4. High-Level Design

Client: Mobile/web app.
API Gateway: Routes, auth (JWT), rate-limiting.
Microservices:
- Restaurant Service: Manages restaurant/menu browsing.
- Order Service: Handles order placement/confirmation.
- Delivery Service: Tracks delivery status.
Data Stores:
- PostgreSQL: Orders, users, restaurants (ACID for consistency).
- Redis: Idempotency, caching, real-time tracking.
- Elasticsearch: Restaurant search.
External: Stripe for payments.
Flow:
- Browse → Restaurant Service → Elasticsearch.
- Order → Order Service → Redis → PostgreSQL → Stripe.
- Track → Delivery Service → Redis → PostgreSQL.

Why PostgreSQL?

PostgreSQL ensures ACID compliance for orders and payments, preventing duplicates with row-level locking and MVCC, scalable to 2K QPS with sharding.

5. Deep Dives (Functional Focus)

1. Ordering Food (Place + Confirm)

Problem: Enable reliable food ordering at 2,083 QPS peak without duplicates or stock issues.
Approaches & Tradeoffs:
- Single-Step Order
  - How: POST /orders {restaurant_id, items[], payment_method}; BEGIN; INSERT INTO orders; UPDATE restaurant_stock; COMMIT; Stripe processes payment.
  - Pros: Simple, one call (~200ms), ACID-safe.
  - Cons: No reservation period, payment failures waste stock, high contention (~10ms).
  - Use Case: Low-demand restaurants.
- Two-Step (Place + Confirm)
  - How:
    - Place: POST /orders/place; BEGIN; INSERT INTO orders (status='pending'); UPDATE restaurant_stock FOR UPDATE; COMMIT;
    - Confirm: POST /orders/confirm; Stripe payment, UPDATE orders SET status='confirmed';
  - Pros: Reservation period (e.g., 10min), better UX, reduces contention (~5ms/step).
  - Cons: Timeout logic, stock rollback complexity.
  - Use Case: Standard delivery (e.g., Foodpanda).
- Optimistic Ordering
  - How: Add version to restaurant_stock; UPDATE restaurant_stock SET stock = stock - 1, version = version + 1 WHERE id={item_id} AND stock > 0 AND version={old_version};
  - Pros: High concurrency (~2ms), no locks.
  - Cons: Retries on conflict (10-20% at peak), complex client logic.
  - Use Case: High-read, low-conflict systems.
- Pessimistic Locking
  - How: SELECT * FROM restaurant_stock WHERE id={item_id} AND stock > 0 FOR UPDATE NOWAIT; then UPDATE stock; INSERT orders;
  - Pros: Immediate rejection, ACID-safe.
  - Cons: Lock contention (~10ms), scales poorly at 2K QPS.
  - Use Case: Small-scale systems.
- Hybrid (Redis + PostgreSQL)
  - How:
    - Place: Check idempotency_key in Redis (TTL=10min), reserve stock in Redis, INSERT orders (pending) in PostgreSQL with FOR UPDATE on stock.
    - Confirm: Stripe payment, UPDATE orders SET status='confirmed'; delete Redis key.
  - Pros: Redis (<1ms) scales to 100K QPS, PostgreSQL ensures durability, idempotent.
  - Cons: Redis failure risks over-ordering (mitigated by PostgreSQL), sync complexity.
  - Use Case: High-throughput delivery (e.g., DoorDash).
Industry Example (Uber Eats): Uses a two-step process with distributed locks (e.g., Redis) for stock reservation, confirming via payment gateways.
Optimal Solution: Hybrid Two-Step—Redis for idempotency and stock reservation (<1ms), PostgreSQL with FOR UPDATE and UNIQUE (idempotency_key) for confirmation (~5ms).
Why Optimal: Balances speed (Redis: 100K ops/s), consistency (PostgreSQL: 10 shards, 200 QPS/shard), and UX (10min reservation), meets <500ms latency.
Tradeoffs: Adds Redis dependency (mitigated by PostgreSQL fallback), slight latency overhead (5ms vs. 2ms for optimistic).

2. Tracking Delivery (Real-Time Updates)

Problem: Provide real-time delivery tracking at 2,083 QPS with <1s latency.
Approaches & Tradeoffs:
- Polling
  - How: Client polls GET /orders/{order_id}/track every 5s; SELECT * FROM deliveries WHERE order_id={order_id};
  - Pros: Simple (~100ms), no infra.
  - Cons: High QPS (2K * 12/min = 24K QPS), slow updates (~5s).
  - Use Case: Low-scale systems.
- Long Polling
  - How: Client sends GET /orders/{order_id}/track, server holds request (30s), responds on update.
  - Pros: Lower QPS (~2K/30s = 66 QPS), near-real-time (~1s).
  - Cons: Server resource use, timeouts (~30s).
  - Use Case: Medium-scale apps.
- Server-Sent Events (SSE)
  - How: Open SSE connection; server pushes delivery:status updates via Kafka events.
  - Pros: Real-time (<1s), efficient (2K QPS sustainable).
  - Cons: Connection overhead (~1M connections/server), infra cost.
  - Use Case: High-traffic delivery (e.g., Foodpanda).
- WebSockets
  - How: Bidirectional connection; server pushes location/status, client sends queries.
  - Pros: Real-time (<1s), interactive, 2K QPS scalable.
  - Cons: Higher resource use (~500K connections/server), complexity.
  - Use Case: Premium tracking (e.g., Uber Eats).
- Hybrid (Redis + WebSockets)
  - How: Redis caches driver location (order_id:lat,lon, TTL=1min), WebSockets push updates, PostgreSQL persists history.
  - Pros: Ultra-fast (<1ms cache), real-time (<1s), durable, 2K QPS.
  - Cons: Redis volatility (mitigated by PostgreSQL), dual-system sync.
  - Use Case: Scalable, real-time tracking.
Industry Example (DoorDash): Uses WebSockets with in-memory caching (e.g., Redis) for live driver tracking, ensuring <1s updates.
Optimal Solution: Hybrid—Redis for real-time location caching (<1ms), WebSockets for push updates (<1s), PostgreSQL for persistence.
Why Optimal: Meets <1s latency, scales to 2K QPS (Redis: 100K ops/s), durable with PostgreSQL (10 nodes, 200 QPS/node).
Tradeoffs: Adds Redis/WebSocket infra (mitigated by load balancers), cache staleness tolerable for UX.

3. Browsing Restaurants (Discovery)

Problem: Enable fast, location-based restaurant browsing at 2,083 QPS.
Approaches & Tradeoffs:
- SQL Query
  - How: SELECT * FROM restaurants WHERE ST_DWithin(location, ST_MakePoint({lon}, {lat}), {radius});
  - Pros: Simple, no extra infra (~200ms).
  - Cons: Slow at scale (O(n) scan), ~1s for 100K restaurants.
  - Use Case: Tiny datasets (<10K).
- Full-Text Search (PostgreSQL)
  - How: SELECT * FROM restaurants WHERE to_tsvector(name) @@ to_tsquery('{term}') AND ST_DWithin(...);
  - Pros: Built-in, decent speed (~150ms).
  - Cons: Limited scalability (~1K QPS), geospatial overhead.
  - Use Case: Small-scale search.
- Elasticsearch
  - How: Index restaurants with geo-point; GET /restaurants/_search {query: {geo_distance: {lat, lon, radius}}}.
  - Pros: Fast (~50ms), geospatial support, 2K QPS scalable.
  - Cons: Sync complexity, higher storage (~2x PostgreSQL).
  - Use Case: Large-scale discovery (e.g., Foodpanda).
- Geohash (PostGIS)
  - How: Store geohash in PostgreSQL, SELECT * FROM restaurants WHERE geohash LIKE '{prefix}%';
  - Pros: Precise (~100ms), SQL-integrated, compact index.
  - Cons: Slower than Elasticsearch, edge cases (~500m error).
  - Use Case: Location-focused apps.
- Hybrid (Elasticsearch + Redis)
  - How: Elasticsearch for search, Redis caches top 1K queries (TTL=1h).
  - Pros: Ultra-fast (<10ms with cache), 2K QPS sustainable, geospatial/text support.
  - Cons: Cache invalidation, dual-system sync.
  - Use Case: High-traffic, popular areas.
Industry Example (Grubhub): Uses Elasticsearch with caching for fast, location-based restaurant discovery.
Optimal Solution: Hybrid—Elasticsearch for geospatial/text search (~50ms), Redis for caching (<10ms).
Why Optimal: Handles 2K QPS, <200ms latency, scalable with Elasticsearch (10 shards, 200 QPS/shard) and Redis (100K ops/s).
Tradeoffs: Adds Elasticsearch/Redis infra (mitigated by CDC sync), cache staleness acceptable for UX.

System Design Diagram

text

CollapseWrapCopy

+----------------+ +----------------+ | Client |<------->| API Gateway | | (Mobile/Web) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Restaurant Service|<--+ | | | (Browse, Menu) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Order Service |<---------| | | | | (Place, Confirm)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Delivery Service|<--------| | | | | (Tracking) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis | | PostgreSQL | | Elasticsearch | | (Cache, Track) | | (Orders, Users)| | (Restaurant Search)| +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | WebSockets | | | | Stripe | | (Real-Time) | | | | (Payments) | +----------------+ +-----+ +----------------+ | Kafka |<--------| | | SQS | | (Events) | | | | (Fallback) | +----------------+ +-----+----+----------------+

Summary of Solutions and Industry Practices

Authentication: PostgreSQL + JWT – DoorDash’s secure login.
Ordering Food: Redis + PostgreSQL Two-Step – Uber Eats’ order flow.
Tracking Delivery: Redis + WebSockets – DoorDash’s real-time updates.
Browsing Restaurants: Elasticsearch + Redis – Grubhub’s fast discovery.
Consistency: PostgreSQL (ACID) – Deliveroo’s duplicate prevention.
Scalability: Kafka + Redis – Foodpanda’s surge handling.