Food Delivery - ashtishad/system-design GitHub Wiki
Functional Requirements:
- Users can browse restaurants and menus.
- Users can order food (place order, confirm payment).
- Users can track delivery in real time.
Non-Functional Requirements (Brief):
- Consistency: No duplicate orders or payments.
- Scalability: Handle 10M DAU, 10% peak surges (e.g., weekends).
- Low Latency: Order <500ms, tracking <1s.
- Durability: No data loss over 5 years.
-
Capacity Estimation (5 years):
- DAU: 10M users.
- Orders: 5M/day * 365 * 5 = 9.125B orders.
- Restaurants: 100K * 50 menu items = 5M items.
-
Storage:
- Orders: 9.125B * 1KB = 9.125TB.
- Restaurants/Menus: 5M * 500B = 2.5GB.
- Total: ~10TB raw, ~100TB with replication/indexes.
- QPS: Avg: 5M/day ÷ 86,400s ≈ 58 QPS; Peak (10% surges): 1M * 5 orders ÷ 2,400s ≈ 2,083 QPS.
- User: {id, name, address, payment_method}
- Restaurant: {id, name, location (lat/lon), menu_items[]}
- MenuItem: {id, restaurant_id, name, price}
- Order: {id, user_id, restaurant_id, items[], total_amount, status (pending/confirmed/delivered), idempotency_key, timestamp}
- Delivery: {id, order_id, driver_id, status (assigned/en_route/delivered), location (lat/lon), timestamp}
- GET /restaurants?location={lat},{lon}&radius={radius}: Returns nearby restaurants.
- GET /restaurants/{restaurant_id}/menu: Returns menu items.
- POST /orders/place: {restaurant_id, items[], idempotency_key}, places order.
- POST /orders/confirm: {order_id, payment_method}, confirms payment.
- GET /orders/{order_id}/track: Returns delivery status/location.
- Client: Mobile/web app.
- API Gateway: Routes, auth (JWT), rate-limiting.
-
Microservices:
- Restaurant Service: Manages restaurant/menu browsing.
- Order Service: Handles order placement/confirmation.
- Delivery Service: Tracks delivery status.
-
Data Stores:
- PostgreSQL: Orders, users, restaurants (ACID for consistency).
- Redis: Idempotency, caching, real-time tracking.
- Elasticsearch: Restaurant search.
- External: Stripe for payments.
-
Flow:
- Browse → Restaurant Service → Elasticsearch.
- Order → Order Service → Redis → PostgreSQL → Stripe.
- Track → Delivery Service → Redis → PostgreSQL.
Why PostgreSQL?
PostgreSQL ensures ACID compliance for orders and payments, preventing duplicates with row-level locking and MVCC, scalable to 2K QPS with sharding.
1. Ordering Food (Place + Confirm)
- Problem: Enable reliable food ordering at 2,083 QPS peak without duplicates or stock issues.
-
Approaches & Tradeoffs:
-
Single-Step Order
- How: POST /orders {restaurant_id, items[], payment_method}; BEGIN; INSERT INTO orders; UPDATE restaurant_stock; COMMIT; Stripe processes payment.
- Pros: Simple, one call (~200ms), ACID-safe.
- Cons: No reservation period, payment failures waste stock, high contention (~10ms).
- Use Case: Low-demand restaurants.
-
Two-Step (Place + Confirm)
-
How:
- Place: POST /orders/place; BEGIN; INSERT INTO orders (status='pending'); UPDATE restaurant_stock FOR UPDATE; COMMIT;
- Confirm: POST /orders/confirm; Stripe payment, UPDATE orders SET status='confirmed';
- Pros: Reservation period (e.g., 10min), better UX, reduces contention (~5ms/step).
- Cons: Timeout logic, stock rollback complexity.
- Use Case: Standard delivery (e.g., Foodpanda).
-
How:
-
Optimistic Ordering
- How: Add version to restaurant_stock; UPDATE restaurant_stock SET stock = stock - 1, version = version + 1 WHERE id={item_id} AND stock > 0 AND version={old_version};
- Pros: High concurrency (~2ms), no locks.
- Cons: Retries on conflict (10-20% at peak), complex client logic.
- Use Case: High-read, low-conflict systems.
-
Pessimistic Locking
- How: SELECT * FROM restaurant_stock WHERE id={item_id} AND stock > 0 FOR UPDATE NOWAIT; then UPDATE stock; INSERT orders;
- Pros: Immediate rejection, ACID-safe.
- Cons: Lock contention (~10ms), scales poorly at 2K QPS.
- Use Case: Small-scale systems.
-
Hybrid (Redis + PostgreSQL)
-
How:
- Place: Check idempotency_key in Redis (TTL=10min), reserve stock in Redis, INSERT orders (pending) in PostgreSQL with FOR UPDATE on stock.
- Confirm: Stripe payment, UPDATE orders SET status='confirmed'; delete Redis key.
- Pros: Redis (<1ms) scales to 100K QPS, PostgreSQL ensures durability, idempotent.
- Cons: Redis failure risks over-ordering (mitigated by PostgreSQL), sync complexity.
- Use Case: High-throughput delivery (e.g., DoorDash).
-
How:
-
Single-Step Order
- Industry Example (Uber Eats): Uses a two-step process with distributed locks (e.g., Redis) for stock reservation, confirming via payment gateways.
- Optimal Solution: Hybrid Two-Step—Redis for idempotency and stock reservation (<1ms), PostgreSQL with FOR UPDATE and UNIQUE (idempotency_key) for confirmation (~5ms).
- Why Optimal: Balances speed (Redis: 100K ops/s), consistency (PostgreSQL: 10 shards, 200 QPS/shard), and UX (10min reservation), meets <500ms latency.
- Tradeoffs: Adds Redis dependency (mitigated by PostgreSQL fallback), slight latency overhead (5ms vs. 2ms for optimistic).
2. Tracking Delivery (Real-Time Updates)
- Problem: Provide real-time delivery tracking at 2,083 QPS with <1s latency.
-
Approaches & Tradeoffs:
-
Polling
- How: Client polls GET /orders/{order_id}/track every 5s; SELECT * FROM deliveries WHERE order_id={order_id};
- Pros: Simple (~100ms), no infra.
- Cons: High QPS (2K * 12/min = 24K QPS), slow updates (~5s).
- Use Case: Low-scale systems.
-
Long Polling
- How: Client sends GET /orders/{order_id}/track, server holds request (30s), responds on update.
- Pros: Lower QPS (~2K/30s = 66 QPS), near-real-time (~1s).
- Cons: Server resource use, timeouts (~30s).
- Use Case: Medium-scale apps.
-
Server-Sent Events (SSE)
- How: Open SSE connection; server pushes delivery:status updates via Kafka events.
- Pros: Real-time (<1s), efficient (2K QPS sustainable).
- Cons: Connection overhead (~1M connections/server), infra cost.
- Use Case: High-traffic delivery (e.g., Foodpanda).
-
WebSockets
- How: Bidirectional connection; server pushes location/status, client sends queries.
- Pros: Real-time (<1s), interactive, 2K QPS scalable.
- Cons: Higher resource use (~500K connections/server), complexity.
- Use Case: Premium tracking (e.g., Uber Eats).
-
Hybrid (Redis + WebSockets)
- How: Redis caches driver location (order_id:lat,lon, TTL=1min), WebSockets push updates, PostgreSQL persists history.
- Pros: Ultra-fast (<1ms cache), real-time (<1s), durable, 2K QPS.
- Cons: Redis volatility (mitigated by PostgreSQL), dual-system sync.
- Use Case: Scalable, real-time tracking.
-
Polling
- Industry Example (DoorDash): Uses WebSockets with in-memory caching (e.g., Redis) for live driver tracking, ensuring <1s updates.
- Optimal Solution: Hybrid—Redis for real-time location caching (<1ms), WebSockets for push updates (<1s), PostgreSQL for persistence.
- Why Optimal: Meets <1s latency, scales to 2K QPS (Redis: 100K ops/s), durable with PostgreSQL (10 nodes, 200 QPS/node).
- Tradeoffs: Adds Redis/WebSocket infra (mitigated by load balancers), cache staleness tolerable for UX.
3. Browsing Restaurants (Discovery)
- Problem: Enable fast, location-based restaurant browsing at 2,083 QPS.
-
Approaches & Tradeoffs:
-
SQL Query
- How: SELECT * FROM restaurants WHERE ST_DWithin(location, ST_MakePoint({lon}, {lat}), {radius});
- Pros: Simple, no extra infra (~200ms).
- Cons: Slow at scale (O(n) scan), ~1s for 100K restaurants.
- Use Case: Tiny datasets (<10K).
-
Full-Text Search (PostgreSQL)
- How: SELECT * FROM restaurants WHERE to_tsvector(name) @@ to_tsquery('{term}') AND ST_DWithin(...);
- Pros: Built-in, decent speed (~150ms).
- Cons: Limited scalability (~1K QPS), geospatial overhead.
- Use Case: Small-scale search.
-
Elasticsearch
- How: Index restaurants with geo-point; GET /restaurants/_search {query: {geo_distance: {lat, lon, radius}}}.
- Pros: Fast (~50ms), geospatial support, 2K QPS scalable.
- Cons: Sync complexity, higher storage (~2x PostgreSQL).
- Use Case: Large-scale discovery (e.g., Foodpanda).
-
Geohash (PostGIS)
- How: Store geohash in PostgreSQL, SELECT * FROM restaurants WHERE geohash LIKE '{prefix}%';
- Pros: Precise (~100ms), SQL-integrated, compact index.
- Cons: Slower than Elasticsearch, edge cases (~500m error).
- Use Case: Location-focused apps.
-
Hybrid (Elasticsearch + Redis)
- How: Elasticsearch for search, Redis caches top 1K queries (TTL=1h).
- Pros: Ultra-fast (<10ms with cache), 2K QPS sustainable, geospatial/text support.
- Cons: Cache invalidation, dual-system sync.
- Use Case: High-traffic, popular areas.
-
SQL Query
- Industry Example (Grubhub): Uses Elasticsearch with caching for fast, location-based restaurant discovery.
- Optimal Solution: Hybrid—Elasticsearch for geospatial/text search (~50ms), Redis for caching (<10ms).
- Why Optimal: Handles 2K QPS, <200ms latency, scalable with Elasticsearch (10 shards, 200 QPS/shard) and Redis (100K ops/s).
- Tradeoffs: Adds Elasticsearch/Redis infra (mitigated by CDC sync), cache staleness acceptable for UX.
text
CollapseWrapCopy
+----------------+ +----------------+ | Client |<------->| API Gateway | | (Mobile/Web) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Restaurant Service|<--+ | | | (Browse, Menu) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Order Service |<---------| | | | | (Place, Confirm)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Delivery Service|<--------| | | | | (Tracking) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis | | PostgreSQL | | Elasticsearch | | (Cache, Track) | | (Orders, Users)| | (Restaurant Search)| +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | WebSockets | | | | Stripe | | (Real-Time) | | | | (Payments) | +----------------+ +-----+ +----------------+ | Kafka |<--------| | | SQS | | (Events) | | | | (Fallback) | +----------------+ +-----+----+----------------+
- Authentication: PostgreSQL + JWT – DoorDash’s secure login.
- Ordering Food: Redis + PostgreSQL Two-Step – Uber Eats’ order flow.
- Tracking Delivery: Redis + WebSockets – DoorDash’s real-time updates.
- Browsing Restaurants: Elasticsearch + Redis – Grubhub’s fast discovery.
- Consistency: PostgreSQL (ACID) – Deliveroo’s duplicate prevention.
- Scalability: Kafka + Redis – Foodpanda’s surge handling.