Food Delivery - ashtishad/system-design GitHub Wiki

1. Requirements

Functional Requirements:

  • Users can browse restaurants and menus.
  • Users can order food (place order, confirm payment).
  • Users can track delivery in real time.

Non-Functional Requirements (Brief):

  • Consistency: No duplicate orders or payments.
  • Scalability: Handle 10M DAU, 10% peak surges (e.g., weekends).
  • Low Latency: Order <500ms, tracking <1s.
  • Durability: No data loss over 5 years.
  • Capacity Estimation (5 years):
    • DAU: 10M users.
    • Orders: 5M/day * 365 * 5 = 9.125B orders.
    • Restaurants: 100K * 50 menu items = 5M items.
    • Storage:
      • Orders: 9.125B * 1KB = 9.125TB.
      • Restaurants/Menus: 5M * 500B = 2.5GB.
      • Total: ~10TB raw, ~100TB with replication/indexes.
    • QPS: Avg: 5M/day ÷ 86,400s ≈ 58 QPS; Peak (10% surges): 1M * 5 orders ÷ 2,400s ≈ 2,083 QPS.

2. Core Entities

  • User: {id, name, address, payment_method}
  • Restaurant: {id, name, location (lat/lon), menu_items[]}
  • MenuItem: {id, restaurant_id, name, price}
  • Order: {id, user_id, restaurant_id, items[], total_amount, status (pending/confirmed/delivered), idempotency_key, timestamp}
  • Delivery: {id, order_id, driver_id, status (assigned/en_route/delivered), location (lat/lon), timestamp}

3. APIs

  • GET /restaurants?location={lat},{lon}&radius={radius}: Returns nearby restaurants.
  • GET /restaurants/{restaurant_id}/menu: Returns menu items.
  • POST /orders/place: {restaurant_id, items[], idempotency_key}, places order.
  • POST /orders/confirm: {order_id, payment_method}, confirms payment.
  • GET /orders/{order_id}/track: Returns delivery status/location.

4. High-Level Design

  • Client: Mobile/web app.
  • API Gateway: Routes, auth (JWT), rate-limiting.
  • Microservices:
    • Restaurant Service: Manages restaurant/menu browsing.
    • Order Service: Handles order placement/confirmation.
    • Delivery Service: Tracks delivery status.
  • Data Stores:
    • PostgreSQL: Orders, users, restaurants (ACID for consistency).
    • Redis: Idempotency, caching, real-time tracking.
    • Elasticsearch: Restaurant search.
  • External: Stripe for payments.
  • Flow:
    • Browse → Restaurant Service → Elasticsearch.
    • Order → Order Service → Redis → PostgreSQL → Stripe.
    • Track → Delivery Service → Redis → PostgreSQL.

Why PostgreSQL?

PostgreSQL ensures ACID compliance for orders and payments, preventing duplicates with row-level locking and MVCC, scalable to 2K QPS with sharding.


5. Deep Dives (Functional Focus)

1. Ordering Food (Place + Confirm)

  • Problem: Enable reliable food ordering at 2,083 QPS peak without duplicates or stock issues.
  • Approaches & Tradeoffs:
    • Single-Step Order
      • How: POST /orders {restaurant_id, items[], payment_method}; BEGIN; INSERT INTO orders; UPDATE restaurant_stock; COMMIT; Stripe processes payment.
      • Pros: Simple, one call (~200ms), ACID-safe.
      • Cons: No reservation period, payment failures waste stock, high contention (~10ms).
      • Use Case: Low-demand restaurants.
    • Two-Step (Place + Confirm)
      • How:
        • Place: POST /orders/place; BEGIN; INSERT INTO orders (status='pending'); UPDATE restaurant_stock FOR UPDATE; COMMIT;
        • Confirm: POST /orders/confirm; Stripe payment, UPDATE orders SET status='confirmed';
      • Pros: Reservation period (e.g., 10min), better UX, reduces contention (~5ms/step).
      • Cons: Timeout logic, stock rollback complexity.
      • Use Case: Standard delivery (e.g., Foodpanda).
    • Optimistic Ordering
      • How: Add version to restaurant_stock; UPDATE restaurant_stock SET stock = stock - 1, version = version + 1 WHERE id={item_id} AND stock > 0 AND version={old_version};
      • Pros: High concurrency (~2ms), no locks.
      • Cons: Retries on conflict (10-20% at peak), complex client logic.
      • Use Case: High-read, low-conflict systems.
    • Pessimistic Locking
      • How: SELECT * FROM restaurant_stock WHERE id={item_id} AND stock > 0 FOR UPDATE NOWAIT; then UPDATE stock; INSERT orders;
      • Pros: Immediate rejection, ACID-safe.
      • Cons: Lock contention (~10ms), scales poorly at 2K QPS.
      • Use Case: Small-scale systems.
    • Hybrid (Redis + PostgreSQL)
      • How:
        • Place: Check idempotency_key in Redis (TTL=10min), reserve stock in Redis, INSERT orders (pending) in PostgreSQL with FOR UPDATE on stock.
        • Confirm: Stripe payment, UPDATE orders SET status='confirmed'; delete Redis key.
      • Pros: Redis (<1ms) scales to 100K QPS, PostgreSQL ensures durability, idempotent.
      • Cons: Redis failure risks over-ordering (mitigated by PostgreSQL), sync complexity.
      • Use Case: High-throughput delivery (e.g., DoorDash).
  • Industry Example (Uber Eats): Uses a two-step process with distributed locks (e.g., Redis) for stock reservation, confirming via payment gateways.
  • Optimal Solution: Hybrid Two-Step—Redis for idempotency and stock reservation (<1ms), PostgreSQL with FOR UPDATE and UNIQUE (idempotency_key) for confirmation (~5ms).
  • Why Optimal: Balances speed (Redis: 100K ops/s), consistency (PostgreSQL: 10 shards, 200 QPS/shard), and UX (10min reservation), meets <500ms latency.
  • Tradeoffs: Adds Redis dependency (mitigated by PostgreSQL fallback), slight latency overhead (5ms vs. 2ms for optimistic).

2. Tracking Delivery (Real-Time Updates)

  • Problem: Provide real-time delivery tracking at 2,083 QPS with <1s latency.
  • Approaches & Tradeoffs:
    • Polling
      • How: Client polls GET /orders/{order_id}/track every 5s; SELECT * FROM deliveries WHERE order_id={order_id};
      • Pros: Simple (~100ms), no infra.
      • Cons: High QPS (2K * 12/min = 24K QPS), slow updates (~5s).
      • Use Case: Low-scale systems.
    • Long Polling
      • How: Client sends GET /orders/{order_id}/track, server holds request (30s), responds on update.
      • Pros: Lower QPS (~2K/30s = 66 QPS), near-real-time (~1s).
      • Cons: Server resource use, timeouts (~30s).
      • Use Case: Medium-scale apps.
    • Server-Sent Events (SSE)
      • How: Open SSE connection; server pushes delivery:status updates via Kafka events.
      • Pros: Real-time (<1s), efficient (2K QPS sustainable).
      • Cons: Connection overhead (~1M connections/server), infra cost.
      • Use Case: High-traffic delivery (e.g., Foodpanda).
    • WebSockets
      • How: Bidirectional connection; server pushes location/status, client sends queries.
      • Pros: Real-time (<1s), interactive, 2K QPS scalable.
      • Cons: Higher resource use (~500K connections/server), complexity.
      • Use Case: Premium tracking (e.g., Uber Eats).
    • Hybrid (Redis + WebSockets)
      • How: Redis caches driver location (order_id:lat,lon, TTL=1min), WebSockets push updates, PostgreSQL persists history.
      • Pros: Ultra-fast (<1ms cache), real-time (<1s), durable, 2K QPS.
      • Cons: Redis volatility (mitigated by PostgreSQL), dual-system sync.
      • Use Case: Scalable, real-time tracking.
  • Industry Example (DoorDash): Uses WebSockets with in-memory caching (e.g., Redis) for live driver tracking, ensuring <1s updates.
  • Optimal Solution: Hybrid—Redis for real-time location caching (<1ms), WebSockets for push updates (<1s), PostgreSQL for persistence.
  • Why Optimal: Meets <1s latency, scales to 2K QPS (Redis: 100K ops/s), durable with PostgreSQL (10 nodes, 200 QPS/node).
  • Tradeoffs: Adds Redis/WebSocket infra (mitigated by load balancers), cache staleness tolerable for UX.

3. Browsing Restaurants (Discovery)

  • Problem: Enable fast, location-based restaurant browsing at 2,083 QPS.
  • Approaches & Tradeoffs:
    • SQL Query
      • How: SELECT * FROM restaurants WHERE ST_DWithin(location, ST_MakePoint({lon}, {lat}), {radius});
      • Pros: Simple, no extra infra (~200ms).
      • Cons: Slow at scale (O(n) scan), ~1s for 100K restaurants.
      • Use Case: Tiny datasets (<10K).
    • Full-Text Search (PostgreSQL)
      • How: SELECT * FROM restaurants WHERE to_tsvector(name) @@ to_tsquery('{term}') AND ST_DWithin(...);
      • Pros: Built-in, decent speed (~150ms).
      • Cons: Limited scalability (~1K QPS), geospatial overhead.
      • Use Case: Small-scale search.
    • Elasticsearch
      • How: Index restaurants with geo-point; GET /restaurants/_search {query: {geo_distance: {lat, lon, radius}}}.
      • Pros: Fast (~50ms), geospatial support, 2K QPS scalable.
      • Cons: Sync complexity, higher storage (~2x PostgreSQL).
      • Use Case: Large-scale discovery (e.g., Foodpanda).
    • Geohash (PostGIS)
      • How: Store geohash in PostgreSQL, SELECT * FROM restaurants WHERE geohash LIKE '{prefix}%';
      • Pros: Precise (~100ms), SQL-integrated, compact index.
      • Cons: Slower than Elasticsearch, edge cases (~500m error).
      • Use Case: Location-focused apps.
    • Hybrid (Elasticsearch + Redis)
      • How: Elasticsearch for search, Redis caches top 1K queries (TTL=1h).
      • Pros: Ultra-fast (<10ms with cache), 2K QPS sustainable, geospatial/text support.
      • Cons: Cache invalidation, dual-system sync.
      • Use Case: High-traffic, popular areas.
  • Industry Example (Grubhub): Uses Elasticsearch with caching for fast, location-based restaurant discovery.
  • Optimal Solution: Hybrid—Elasticsearch for geospatial/text search (~50ms), Redis for caching (<10ms).
  • Why Optimal: Handles 2K QPS, <200ms latency, scalable with Elasticsearch (10 shards, 200 QPS/shard) and Redis (100K ops/s).
  • Tradeoffs: Adds Elasticsearch/Redis infra (mitigated by CDC sync), cache staleness acceptable for UX.

System Design Diagram

text

CollapseWrapCopy

+----------------+ +----------------+ | Client |<------->| API Gateway | | (Mobile/Web) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Restaurant Service|<--+ | | | (Browse, Menu) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Order Service |<---------| | | | | (Place, Confirm)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Delivery Service|<--------| | | | | (Tracking) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis | | PostgreSQL | | Elasticsearch | | (Cache, Track) | | (Orders, Users)| | (Restaurant Search)| +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | WebSockets | | | | Stripe | | (Real-Time) | | | | (Payments) | +----------------+ +-----+ +----------------+ | Kafka |<--------| | | SQS | | (Events) | | | | (Fallback) | +----------------+ +-----+----+----------------+


Summary of Solutions and Industry Practices

  • Authentication: PostgreSQL + JWT – DoorDash’s secure login.
  • Ordering Food: Redis + PostgreSQL Two-Step – Uber Eats’ order flow.
  • Tracking Delivery: Redis + WebSockets – DoorDash’s real-time updates.
  • Browsing Restaurants: Elasticsearch + Redis – Grubhub’s fast discovery.
  • Consistency: PostgreSQL (ACID) – Deliveroo’s duplicate prevention.
  • Scalability: Kafka + Redis – Foodpanda’s surge handling.
⚠️ **GitHub.com Fallback** ⚠️