Scaling Strategy - yuzvak/flashsale-service GitHub Wiki

Scaling Strategy

Current Architecture Limits

Single Instance Design

  • Target: 10,000 RPS for flash sale events
  • Database: PostgreSQL with 100 max connections
  • Cache: Redis with 200 connection pool
  • Concurrency: Optimized for high concurrent checkout/purchase operations

Performance Optimizations

1. Connection Pool Configuration

// PostgreSQL
MaxOpenConns: 100
MaxIdleConns: 50
ConnMaxLifetime: 1 hour
ConnMaxIdleTime: 30 minutes

// Redis  
PoolSize: 100
MaxIdleConns: 50
Timeout: 30 seconds

2. Database Optimizations

Indexing Strategy

-- Hot path indexes for performance
CREATE INDEX idx_items_sale_sold ON items(sale_id) WHERE sold = FALSE;
CREATE INDEX idx_sales_active ON sales(ended_at) WHERE ended_at > CURRENT_TIMESTAMP;
CREATE INDEX idx_checkout_code ON checkout_attempts(checkout_code);

Transaction Isolation

  • SERIALIZABLE isolation for purchase operations
  • Atomic updates with conditional WHERE clauses
  • Minimal transaction scope to reduce lock contention

3. Bloom Filter Performance

// Parameters for 10,000 items with 0.1% false positive rate
BloomSize: 128KB (1 << 17)
BloomHashes: 7
UpdateInterval: 10 seconds

Load Testing Results

Test Configuration

# Load testing commands available
make load-test-light    # 200 users, 120s
make load-test          # 400 users, 300s  
make load-test-heavy    # 800 users, 600s
make load-test-stress   # 1500 users, 900s

Realistic Load Testing

  • User Profiles: 10% aggressive, 60% normal, 30% browsers
  • Behavior Simulation: Checkout probability, purchase delays
  • Item Distribution: Popular vs normal item selection
  • Database Integration: Real item availability tracking

Horizontal Scaling Considerations

1. Application Layer

Current Limitations for Multi-Instance:

  • Bloom filter consistency across instances
  • Distributed counter synchronization
  • User checkout code sharing

Scaling Solutions:

  • Redis as shared state store
  • Database-driven item availability
  • Distributed locking for purchases

2. Database Scaling

Current Approach:

  • Single PostgreSQL instance with connection pooling
  • Optimized queries with proper indexing
  • Serializable transactions for data consistency

Future Options:

  • Read replicas for sale/item queries
  • Connection pooling with PgBouncer
  • Partitioning by sale_id for historical data

3. Cache Layer Scaling

Current Redis Usage:

  • Atomic operations via Lua scripts
  • Distributed locks for purchase coordination
  • Bloom filters for fast item rejection

Scaling Path:

  • Redis Cluster for horizontal scaling
  • Redis Sentinel for high availability
  • Separate Redis instances by function

Monitoring and Observability

1. Performance Metrics

# Four Golden Signals implemented
http_request_duration_seconds     # Latency
http_requests_total              # Traffic  
http_requests_total{status=~"5.."} # Errors
process_resident_memory_bytes    # Saturation

2. Business Metrics

sale_items_sold_total           # Items sold rate
checkout_attempts_total         # Checkout funnel
purchase_success_total          # Purchase conversion
redis_lock_duration_seconds     # Lock contention

3. Database Performance

db_query_duration_seconds       # Query performance
db_connections_active          # Connection usage
db_connections_idle            # Pool efficiency

Bottleneck Analysis

1. Database Write Operations

Current Mitigation:

  • Batch item creation (1000 items per batch)
  • Prepared statements for all queries
  • Minimal transaction scope

2. Redis Lock Contention

Current Strategy:

  • 3-second lock timeout
  • Single purchase attempt per checkout code
  • Immediate lock release after operation

3. Checkout Code Generation

Performance Approach:

  • Cryptographic randomness for uniqueness
  • 256-bit entropy in checkout codes
  • In-memory code validation cache

Crisis Response Procedures

1. High Load Scenarios

  • Monitor /health endpoint for service degradation
  • Scale database connections if CPU allows
  • Increase Redis connection pool size
  • Enable request rate limiting if needed

2. Database Issues

  • Graceful degradation to read-only operations
  • Bloom filter fallback for sold item checks
  • Cache-first approach for user limits

3. Redis Failures

  • Disable bloom filter optimization
  • Direct database queries for all operations
  • Maintain core functionality without cache

Future Scaling Paths

1. Microservice Decomposition

  • Checkout Service: User cart management
  • Purchase Service: Atomic transaction processing
  • Inventory Service: Item availability tracking
  • Analytics Service: Checkout attempt logging

2. Event-Driven Architecture

  • Event sourcing for purchase history
  • CQRS for read/write separation
  • Message queues for async processing

3. Geographic Distribution

  • Regional Redis clusters
  • Database read replicas per region
  • CDN for static content delivery