Ride Sharing - ashtishad/system-design GitHub Wiki

1. Requirements

Functional Requirements:

Users can request a ride (specify pickup/drop-off).
System can match drivers to ride requests in real time.
Users can track rides (driver location, ETA).

Non-Functional Requirements (Brief):

Consistency: No double booking of drivers.
Scalability: Handle 10M DAU, 10% peak surges (e.g., rush hour).
Low Latency: Request <500ms, tracking <1s.
Durability: No data loss over 5 years.
Capacity Estimation (5 years):
- DAU: 10M users.
- Rides: 5M/day * 365 * 5 = 9.125B rides.
- Drivers: 1M active drivers.
- Storage:
  - Rides: 9.125B * 1KB = 9.125TB.
  - Users/Drivers: 10M * 500B + 1M * 500B = 5.5GB.
  - Total: ~10TB raw, ~100TB with replication/indexes.
- QPS: Avg: 5M/day ÷ 86,400s ≈ 58 QPS; Peak (10% surges): 1M * 5 rides ÷ 2,400s ≈ 2,083 QPS.

2. Core Entities

User: {id, name, location (lat/lon), payment_method}
Driver: {id, name, location (lat/lon), status (available/busy), vehicle}
Ride: {id, user_id, driver_id, pickup (lat/lon), dropoff (lat/lon), status (requested/matched/en_route/completed), idempotency_key, timestamp}
TripUpdate: {id, ride_id, driver_location (lat/lon), eta, timestamp}

3. APIs

POST /rides/request: {pickup_lat, pickup_lon, dropoff_lat, dropoff_lon, idempotency_key}, requests a ride.
GET /rides/{ride_id}/track: Returns driver location/ETA.
POST /rides/{ride_id}/accept: {driver_id}, driver accepts ride.

4. High-Level Design

Client: Mobile app (rider/driver).
API Gateway: Routes, auth (JWT), rate-limiting.
Microservices:
- Ride Service: Manages ride requests/acceptance.
- Matching Service: Pairs riders with drivers.
- Tracking Service: Updates ride status/location.
Data Stores:
- PostgreSQL: Rides, users (ACID for consistency).
- Redis: Driver availability, real-time tracking.
- Elasticsearch: Geospatial driver search.
External: Stripe for payments.
Flow:
- Request → Ride Service → Redis → PostgreSQL.
- Match → Matching Service → Elasticsearch → PostgreSQL.
- Track → Tracking Service → Redis → PostgreSQL.

Why PostgreSQL?

PostgreSQL ensures ACID compliance for ride assignments, preventing double bookings with row-level locking and MVCC, scalable to 2K QPS with sharding.

5. Deep Dives (Functional Focus)

1. Requesting a Ride (Ride Creation)

Problem: Enable reliable ride requests at 2,083 QPS peak without duplicates.
Approaches & Tradeoffs:
- Direct Request
  - How: POST /rides/request; INSERT INTO rides (status='requested'); notify drivers.
  - Pros: Simple (~100ms), ACID-safe.
  - Cons: No idempotency, duplicates on retries, slow driver matching (~200ms).
  - Use Case: Low-scale systems (<100 QPS).
- Idempotent Request
  - How: Check idempotency_key in Redis, INSERT INTO rides (status='requested') ON CONFLICT (idempotency_key) DO NOTHING;
  - Pros: Prevents duplicates (~5ms with Redis), scalable.
  - Cons: Redis failure risks duplicates (mitigated by PostgreSQL), no async decoupling.
  - Use Case: Medium-scale apps (e.g., Lyft).
- Event-Driven Request
  - How: Write request to Kafka, Ride Service consumes, inserts into PostgreSQL with UNIQUE (idempotency_key).
  - Pros: Decouples load (~500ms), scales to 10K QPS, retryable.
  - Cons: Higher latency, eventual consistency risks.
  - Use Case: High-throughput systems with latency tolerance.
- Optimistic Request
  - How: Add version to rides; INSERT INTO rides ... WHERE NOT EXISTS (SELECT ... WHERE idempotency_key={key} AND version={old_version});
  - Pros: High concurrency (~2ms), no locks.
  - Cons: Retries on conflict (10-20% at peak), complex client logic.
  - Use Case: Low-conflict systems.
- Hybrid (Redis + PostgreSQL + Kafka)
  - How: Check idempotency_key in Redis (TTL=5min), write intent to Kafka, process with INSERT ON CONFLICT DO NOTHING in PostgreSQL, notify via Kafka.
  - Pros: Real-time (<500ms), idempotent, scales to 2K QPS, durable.
  - Cons: Multi-system complexity, Redis failure risks (mitigated by PostgreSQL).
  - Use Case: Uber-scale ride-sharing.
Industry Example (Uber): Uses idempotency with a distributed cache (e.g., Redis) and a relational DB for ride requests, ensuring no duplicates.
Optimal Solution: Hybrid—Redis for idempotency (<1ms), Kafka for decoupling, PostgreSQL with UNIQUE (idempotency_key) for consistency (~5ms).
Why Optimal: Balances speed (Redis: 100K ops/s), scalability (Kafka: 10K msg/s), and safety (PostgreSQL: 10 nodes, 200 QPS/node), meets <500ms latency.
Tradeoffs: Adds infra complexity (mitigated by retries), slight latency overhead (5ms vs. 2ms for optimistic).

2. Matching Drivers (Driver Assignment)

Problem: Match riders to available drivers in real time at 2,083 QPS.
Approaches & Tradeoffs:
- Simple Nearest Driver
  - How: SELECT * FROM drivers WHERE status='available' ORDER BY ST_Distance(location, ST_MakePoint({pickup_lon}, {pickup_lat})) LIMIT 1; assign ride.
  - Pros: Simple (~100ms), no extra infra.
  - Cons: Slow at scale (O(n) scan), ~1s for 1M drivers, no load balancing.
  - Use Case: Tiny fleets (<1K drivers).
- Geohash (PostGIS)
  - How: Store geohash in PostgreSQL, SELECT * FROM drivers WHERE geohash LIKE '{pickup_prefix}%' AND status='available' LIMIT 1;
  - Pros: Fast (~50ms), geospatial-aware, compact index.
  - Cons: Edge cases (~500m error), scales to ~1K QPS.
  - Use Case: Medium-scale systems.
- Elasticsearch Geospatial
  - How: Index driver locations; GET /drivers/_search {query: {geo_distance: {pickup_lat, pickup_lon, radius: 5km}, status: 'available'}} sort by distance.
  - Pros: Ultra-fast (~20ms), scales to 2K QPS, flexible filtering.
  - Cons: Sync complexity, higher storage (~2x PostgreSQL).
  - Use Case: Large-scale ride-sharing (e.g., Uber).
- Redis Sorted Set
  - How: Store drivers in Redis (ZADD drivers:available {score: distance} {driver_id}), fetch nearest with ZRANGEBYSCORE.
  - Pros: Real-time (<1ms), 100K QPS scalable.
  - Cons: Volatile, no persistence, frequent updates (~1M writes/s).
  - Use Case: High-frequency matching.
- Hybrid (Elasticsearch + Redis)
  - How: Elasticsearch for initial geospatial search (~20ms), Redis for real-time availability/status (TTL=30s), assign via PostgreSQL (FOR UPDATE).
  - Pros: Real-time (<50ms), durable, scales to 2K QPS.
  - Cons: Multi-system sync, Redis failure risks (mitigated by PostgreSQL).
  - Use Case: Uber-scale dynamic matching.
Industry Example (Uber): Uses a hybrid approach with geospatial indexing (e.g., S2/Google Maps) and in-memory caching for driver matching, optimized for speed and scale.
Optimal Solution: Hybrid—Elasticsearch for geospatial search (<20ms), Redis for availability (<1ms), PostgreSQL with FOR UPDATE for assignment (~5ms).
Why Optimal: Meets <500ms latency, handles 2K QPS (Elasticsearch: 10 shards, 200 QPS/shard; Redis: 100K ops/s), ensures consistency.
Tradeoffs: Adds Elasticsearch/Redis infra (mitigated by CDC sync), slight precision trade-off vs. S2 (~1m vs. <1cm).

3. Tracking Rides (Real-Time Updates)

Problem: Provide real-time ride tracking (driver location/ETA) at 2,083 QPS with <1s latency.
Approaches & Tradeoffs:
- Polling
  - How: Client polls GET /rides/{ride_id}/track every 5s; SELECT * FROM trip_updates WHERE ride_id={ride_id};
  - Pros: Simple (~100ms), no infra.
  - Cons: High QPS (2K * 12/min = 24K QPS), slow updates (~5s).
  - Use Case: Low-scale systems.
- Long Polling
  - How: Client sends GET /rides/{ride_id}/track, server holds (30s), responds on update.
  - Pros: Lower QPS (~2K/30s = 66 QPS), near-real-time (~1s).
  - Cons: Server resource use, timeouts (~30s).
  - Use Case: Medium-scale apps.
- Server-Sent Events (SSE)
  - How: Open SSE connection; server pushes ride:location updates via Kafka events.
  - Pros: Real-time (<1s), efficient (2K QPS sustainable).
  - Cons: Connection overhead (~1M connections/server), infra cost.
  - Use Case: High-traffic tracking (e.g., Lyft).
- WebSockets
  - How: Bidirectional connection; server pushes location/ETA, client queries ETA.
  - Pros: Real-time (<1s), interactive, 2K QPS scalable.
  - Cons: Higher resource use (~500K connections/server), complexity.
  - Use Case: Premium tracking (e.g., Uber).
- Hybrid (Redis + WebSockets)
  - How: Redis caches driver location (ride_id:lat,lon, TTL=1min), WebSockets push updates, PostgreSQL persists history.
  - Pros: Ultra-fast (<1ms cache), real-time (<1s), durable, 2K QPS.
  - Cons: Redis volatility (mitigated by PostgreSQL), dual-system sync.
  - Use Case: Scalable, real-time tracking.
Industry Example (Uber): Uses WebSockets with in-memory caching (e.g., Redis) for live driver tracking, ensuring <1s updates with ETA calculations.
Optimal Solution: Hybrid—Redis for real-time location caching (<1ms), WebSockets for push updates (<1s), PostgreSQL for persistence.
Why Optimal: Meets <1s latency, scales to 2K QPS (Redis: 100K ops/s), durable with PostgreSQL (10 nodes, 200 QPS/node).
Tradeoffs: Adds Redis/WebSocket infra (mitigated by load balancers), cache staleness tolerable for UX.

System Design Diagram

text

CollapseWrapCopy

+----------------+ +----------------+ | Client |<------->| API Gateway | | (Rider/Driver) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Ride Service |<---+ | | | (Request/Accept)| | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Matching Service|<--------| | | | | (Driver Matching)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Tracking Service|<--------| | | | | (Ride Updates) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis | | PostgreSQL | | Elasticsearch | | (Cache, Track) | | (Rides, Users) | | (Driver Search)| +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | WebSockets | | | | Stripe | | (Real-Time) | | | | (Payments) | +----------------+ +-----+ +----------------+ | Kafka |<--------| | | SQS | | (Events) | | | | (Fallback) | +----------------+ +-----+----+----------------+

Summary of Solutions and Industry Practices

Authentication: PostgreSQL + JWT – Uber’s secure login.
Requesting a Ride: Redis + PostgreSQL + Kafka – Uber’s ride creation.
Matching Drivers: Elasticsearch + Redis – Uber’s geospatial matching.
Tracking Rides: Redis + WebSockets – Lyft’s real-time updates.
Consistency: PostgreSQL (ACID) – Didi’s double-booking prevention.
Scalability: Kafka + Redis – Grab’s surge handling.