Ride Sharing - ashtishad/system-design GitHub Wiki
Functional Requirements:
- Users can request a ride (specify pickup/drop-off).
- System can match drivers to ride requests in real time.
- Users can track rides (driver location, ETA).
Non-Functional Requirements (Brief):
- Consistency: No double booking of drivers.
- Scalability: Handle 10M DAU, 10% peak surges (e.g., rush hour).
- Low Latency: Request <500ms, tracking <1s.
- Durability: No data loss over 5 years.
-
Capacity Estimation (5 years):
- DAU: 10M users.
- Rides: 5M/day * 365 * 5 = 9.125B rides.
- Drivers: 1M active drivers.
-
Storage:
- Rides: 9.125B * 1KB = 9.125TB.
- Users/Drivers: 10M * 500B + 1M * 500B = 5.5GB.
- Total: ~10TB raw, ~100TB with replication/indexes.
- QPS: Avg: 5M/day ÷ 86,400s ≈ 58 QPS; Peak (10% surges): 1M * 5 rides ÷ 2,400s ≈ 2,083 QPS.
- User: {id, name, location (lat/lon), payment_method}
- Driver: {id, name, location (lat/lon), status (available/busy), vehicle}
- Ride: {id, user_id, driver_id, pickup (lat/lon), dropoff (lat/lon), status (requested/matched/en_route/completed), idempotency_key, timestamp}
- TripUpdate: {id, ride_id, driver_location (lat/lon), eta, timestamp}
- POST /rides/request: {pickup_lat, pickup_lon, dropoff_lat, dropoff_lon, idempotency_key}, requests a ride.
- GET /rides/{ride_id}/track: Returns driver location/ETA.
- POST /rides/{ride_id}/accept: {driver_id}, driver accepts ride.
- Client: Mobile app (rider/driver).
- API Gateway: Routes, auth (JWT), rate-limiting.
-
Microservices:
- Ride Service: Manages ride requests/acceptance.
- Matching Service: Pairs riders with drivers.
- Tracking Service: Updates ride status/location.
-
Data Stores:
- PostgreSQL: Rides, users (ACID for consistency).
- Redis: Driver availability, real-time tracking.
- Elasticsearch: Geospatial driver search.
- External: Stripe for payments.
-
Flow:
- Request → Ride Service → Redis → PostgreSQL.
- Match → Matching Service → Elasticsearch → PostgreSQL.
- Track → Tracking Service → Redis → PostgreSQL.
Why PostgreSQL?
PostgreSQL ensures ACID compliance for ride assignments, preventing double bookings with row-level locking and MVCC, scalable to 2K QPS with sharding.
1. Requesting a Ride (Ride Creation)
- Problem: Enable reliable ride requests at 2,083 QPS peak without duplicates.
-
Approaches & Tradeoffs:
-
Direct Request
- How: POST /rides/request; INSERT INTO rides (status='requested'); notify drivers.
- Pros: Simple (~100ms), ACID-safe.
- Cons: No idempotency, duplicates on retries, slow driver matching (~200ms).
- Use Case: Low-scale systems (<100 QPS).
-
Idempotent Request
- How: Check idempotency_key in Redis, INSERT INTO rides (status='requested') ON CONFLICT (idempotency_key) DO NOTHING;
- Pros: Prevents duplicates (~5ms with Redis), scalable.
- Cons: Redis failure risks duplicates (mitigated by PostgreSQL), no async decoupling.
- Use Case: Medium-scale apps (e.g., Lyft).
-
Event-Driven Request
- How: Write request to Kafka, Ride Service consumes, inserts into PostgreSQL with UNIQUE (idempotency_key).
- Pros: Decouples load (~500ms), scales to 10K QPS, retryable.
- Cons: Higher latency, eventual consistency risks.
- Use Case: High-throughput systems with latency tolerance.
-
Optimistic Request
- How: Add version to rides; INSERT INTO rides ... WHERE NOT EXISTS (SELECT ... WHERE idempotency_key={key} AND version={old_version});
- Pros: High concurrency (~2ms), no locks.
- Cons: Retries on conflict (10-20% at peak), complex client logic.
- Use Case: Low-conflict systems.
-
Hybrid (Redis + PostgreSQL + Kafka)
- How: Check idempotency_key in Redis (TTL=5min), write intent to Kafka, process with INSERT ON CONFLICT DO NOTHING in PostgreSQL, notify via Kafka.
- Pros: Real-time (<500ms), idempotent, scales to 2K QPS, durable.
- Cons: Multi-system complexity, Redis failure risks (mitigated by PostgreSQL).
- Use Case: Uber-scale ride-sharing.
-
Direct Request
- Industry Example (Uber): Uses idempotency with a distributed cache (e.g., Redis) and a relational DB for ride requests, ensuring no duplicates.
- Optimal Solution: Hybrid—Redis for idempotency (<1ms), Kafka for decoupling, PostgreSQL with UNIQUE (idempotency_key) for consistency (~5ms).
- Why Optimal: Balances speed (Redis: 100K ops/s), scalability (Kafka: 10K msg/s), and safety (PostgreSQL: 10 nodes, 200 QPS/node), meets <500ms latency.
- Tradeoffs: Adds infra complexity (mitigated by retries), slight latency overhead (5ms vs. 2ms for optimistic).
2. Matching Drivers (Driver Assignment)
- Problem: Match riders to available drivers in real time at 2,083 QPS.
-
Approaches & Tradeoffs:
-
Simple Nearest Driver
- How: SELECT * FROM drivers WHERE status='available' ORDER BY ST_Distance(location, ST_MakePoint({pickup_lon}, {pickup_lat})) LIMIT 1; assign ride.
- Pros: Simple (~100ms), no extra infra.
- Cons: Slow at scale (O(n) scan), ~1s for 1M drivers, no load balancing.
- Use Case: Tiny fleets (<1K drivers).
-
Geohash (PostGIS)
- How: Store geohash in PostgreSQL, SELECT * FROM drivers WHERE geohash LIKE '{pickup_prefix}%' AND status='available' LIMIT 1;
- Pros: Fast (~50ms), geospatial-aware, compact index.
- Cons: Edge cases (~500m error), scales to ~1K QPS.
- Use Case: Medium-scale systems.
-
Elasticsearch Geospatial
- How: Index driver locations; GET /drivers/_search {query: {geo_distance: {pickup_lat, pickup_lon, radius: 5km}, status: 'available'}} sort by distance.
- Pros: Ultra-fast (~20ms), scales to 2K QPS, flexible filtering.
- Cons: Sync complexity, higher storage (~2x PostgreSQL).
- Use Case: Large-scale ride-sharing (e.g., Uber).
-
Redis Sorted Set
- How: Store drivers in Redis (ZADD drivers:available {score: distance} {driver_id}), fetch nearest with ZRANGEBYSCORE.
- Pros: Real-time (<1ms), 100K QPS scalable.
- Cons: Volatile, no persistence, frequent updates (~1M writes/s).
- Use Case: High-frequency matching.
-
Hybrid (Elasticsearch + Redis)
- How: Elasticsearch for initial geospatial search (~20ms), Redis for real-time availability/status (TTL=30s), assign via PostgreSQL (FOR UPDATE).
- Pros: Real-time (<50ms), durable, scales to 2K QPS.
- Cons: Multi-system sync, Redis failure risks (mitigated by PostgreSQL).
- Use Case: Uber-scale dynamic matching.
-
Simple Nearest Driver
- Industry Example (Uber): Uses a hybrid approach with geospatial indexing (e.g., S2/Google Maps) and in-memory caching for driver matching, optimized for speed and scale.
- Optimal Solution: Hybrid—Elasticsearch for geospatial search (<20ms), Redis for availability (<1ms), PostgreSQL with FOR UPDATE for assignment (~5ms).
- Why Optimal: Meets <500ms latency, handles 2K QPS (Elasticsearch: 10 shards, 200 QPS/shard; Redis: 100K ops/s), ensures consistency.
- Tradeoffs: Adds Elasticsearch/Redis infra (mitigated by CDC sync), slight precision trade-off vs. S2 (~1m vs. <1cm).
3. Tracking Rides (Real-Time Updates)
- Problem: Provide real-time ride tracking (driver location/ETA) at 2,083 QPS with <1s latency.
-
Approaches & Tradeoffs:
-
Polling
- How: Client polls GET /rides/{ride_id}/track every 5s; SELECT * FROM trip_updates WHERE ride_id={ride_id};
- Pros: Simple (~100ms), no infra.
- Cons: High QPS (2K * 12/min = 24K QPS), slow updates (~5s).
- Use Case: Low-scale systems.
-
Long Polling
- How: Client sends GET /rides/{ride_id}/track, server holds (30s), responds on update.
- Pros: Lower QPS (~2K/30s = 66 QPS), near-real-time (~1s).
- Cons: Server resource use, timeouts (~30s).
- Use Case: Medium-scale apps.
-
Server-Sent Events (SSE)
- How: Open SSE connection; server pushes ride:location updates via Kafka events.
- Pros: Real-time (<1s), efficient (2K QPS sustainable).
- Cons: Connection overhead (~1M connections/server), infra cost.
- Use Case: High-traffic tracking (e.g., Lyft).
-
WebSockets
- How: Bidirectional connection; server pushes location/ETA, client queries ETA.
- Pros: Real-time (<1s), interactive, 2K QPS scalable.
- Cons: Higher resource use (~500K connections/server), complexity.
- Use Case: Premium tracking (e.g., Uber).
-
Hybrid (Redis + WebSockets)
- How: Redis caches driver location (ride_id:lat,lon, TTL=1min), WebSockets push updates, PostgreSQL persists history.
- Pros: Ultra-fast (<1ms cache), real-time (<1s), durable, 2K QPS.
- Cons: Redis volatility (mitigated by PostgreSQL), dual-system sync.
- Use Case: Scalable, real-time tracking.
-
Polling
- Industry Example (Uber): Uses WebSockets with in-memory caching (e.g., Redis) for live driver tracking, ensuring <1s updates with ETA calculations.
- Optimal Solution: Hybrid—Redis for real-time location caching (<1ms), WebSockets for push updates (<1s), PostgreSQL for persistence.
- Why Optimal: Meets <1s latency, scales to 2K QPS (Redis: 100K ops/s), durable with PostgreSQL (10 nodes, 200 QPS/node).
- Tradeoffs: Adds Redis/WebSocket infra (mitigated by load balancers), cache staleness tolerable for UX.
text
CollapseWrapCopy
+----------------+ +----------------+ | Client |<------->| API Gateway | | (Rider/Driver) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Ride Service |<---+ | | | (Request/Accept)| | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Matching Service|<--------| | | | | (Driver Matching)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Tracking Service|<--------| | | | | (Ride Updates) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis | | PostgreSQL | | Elasticsearch | | (Cache, Track) | | (Rides, Users) | | (Driver Search)| +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | WebSockets | | | | Stripe | | (Real-Time) | | | | (Payments) | +----------------+ +-----+ +----------------+ | Kafka |<--------| | | SQS | | (Events) | | | | (Fallback) | +----------------+ +-----+----+----------------+
- Authentication: PostgreSQL + JWT – Uber’s secure login.
- Requesting a Ride: Redis + PostgreSQL + Kafka – Uber’s ride creation.
- Matching Drivers: Elasticsearch + Redis – Uber’s geospatial matching.
- Tracking Rides: Redis + WebSockets – Lyft’s real-time updates.
- Consistency: PostgreSQL (ACID) – Didi’s double-booking prevention.
- Scalability: Kafka + Redis – Grab’s surge handling.