Design Decisions & Architecture Rationale - yuzvak/flashsale-service GitHub Wiki
Design Decisions & Architecture Rationale
Overview
This document explains the key architectural decisions made in the flash sale microservice, including trade-offs, alternatives considered, and implementation rationale.
Domain-Driven Design (DDD) Implementation
Decision: Clean Architecture with DDD
Rationale: Separation of concerns, testability, and maintainability for complex business logic.
Project Structure
internal/
├── domain/ # Pure business logic, no external dependencies
│ ├── sale/ # Sale aggregate root
│ ├── user/ # User value objects
│ └── errors/ # Domain errors
├── application/ # Use cases and orchestration
│ ├── commands/ # Command handlers
│ ├── use_cases/ # Business workflows
│ └── ports/ # Interface definitions
└── infrastructure/ # External concerns
├── persistence/ # Database implementations
├── http/ # HTTP handlers
└── monitoring/ # Observability
Benefits:
- Clear separation between business logic and infrastructure
- Easy to test domain logic in isolation
- Framework-agnostic design
- Scalable codebase structure
Trade-offs:
- More boilerplate code than simple layered architecture
- Learning curve for developers unfamiliar with DDD
Database Design Decisions
Decision: PostgreSQL with Normalized Schema
Alternative Considered: NoSQL databases (MongoDB, DynamoDB)
Rationale:
- ACID transactions required for inventory management
- Complex queries for analytics and reporting
- Strong consistency guarantees for financial operations
- Mature ecosystem and operational tooling
Schema Normalization
-- Separate tables for checkout attempts and items
CREATE TABLE checkout_attempts (
id VARCHAR(255) PRIMARY KEY,
sale_id VARCHAR(20) NOT NULL,
user_id VARCHAR(255) NOT NULL,
checkout_code VARCHAR(64) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE checkout_items (
id VARCHAR(255) PRIMARY KEY,
checkout_attempt_id VARCHAR(255) NOT NULL,
item_id VARCHAR(255) NOT NULL,
added_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(checkout_attempt_id, item_id)
);
Benefits:
- Better query performance than JSONB
- Referential integrity enforcement
- Easier analytics and reporting
- Standard SQL optimization techniques apply
Trade-offs:
- More complex queries for retrieving complete checkout data
- Additional JOIN operations
Decision: Serializable Isolation for Critical Operations
tx, err := r.db.BeginTx(ctx, &sql.TxOptions{
Isolation: sql.LevelSerializable,
})
Rationale:
- Prevents race conditions in high-concurrency scenarios
- Ensures atomic operations for inventory updates
- Maintains data consistency under load
Trade-offs:
- Higher latency for write operations
- Potential for serialization failures requiring retries
Caching Strategy Decisions
Decision: Redis as Cache + Atomic Operations Coordinator
Alternative Considered: In-memory caching only
Rationale:
- Distributed system requires shared state
- Atomic operations via Lua scripts
- High-performance data structures (sets, counters)
- Persistence and durability options
Lua Scripts for Atomicity
-- Atomic purchase validation and counter updates
local sale_key = KEYS[1]
local user_key = KEYS[2]
local item_count = tonumber(ARGV[1])
local max_sale_items = tonumber(ARGV[2])
local max_user_items = tonumber(ARGV[3])
-- Check limits and update counters atomically
if current_sale_count + item_count > max_sale_items then
return 0
end
redis.call('INCRBY', sale_key, item_count)
redis.call('INCRBY', user_key, item_count)
return 1
Benefits:
- Eliminates race conditions between limit checks and updates
- Single network round trip for complex operations
- Consistent behavior across multiple service instances
Trade-offs:
- Lua script complexity
- Redis becomes critical dependency
- Debugging distributed state can be challenging
Decision: Bloom Filter for Performance Optimization
Alternative Considered: Database-only item availability checks
Rationale:
- Fast negative lookups (99.9% accuracy)
- Reduces database load for popular items
- Memory-efficient (128KB for 100,000 items)
// 10-second update interval balances accuracy and performance
const BloomUpdateInterval = 10 * time.Second
func UpdateBloomFilter(saleID string) {
ticker := time.NewTicker(BloomUpdateInterval)
for range ticker.C {
soldItems := db.Query(`SELECT id FROM items WHERE sale_id = $1 AND sold = TRUE`, saleID)
bloom := NewBloomFilter(BloomSize, BloomHashes)
for _, itemID := range soldItems {
bloom.Add([]byte(itemID))
}
redis.Set("sale:" + saleID + ":bloom:sold", bloom.Serialize())
}
}
Benefits:
- 90%+ reduction in database queries for sold items
- Sub-millisecond response times
- Scales well with item count
Trade-offs:
- 10-second lag for item availability updates
- False positives possible (but acceptable for this use case)
- Additional complexity in cache management
Concurrency Control Decisions
Decision: Optimistic Locking with Conditional Updates
Alternative Considered: Pessimistic locking with SELECT FOR UPDATE
-- Atomic item purchase with conditional update
UPDATE items
SET sold = TRUE, sold_to_user_id = $1, sold_at = NOW()
WHERE id = $2 AND sale_id = $3 AND sold = FALSE
Rationale:
- Better performance under high contention
- No lock timeouts or deadlocks
- Natural failure mode for race conditions
Benefits:
- High throughput even with many concurrent purchases
- Simple error handling (0 rows affected = already sold)
- Database handles optimization automatically
Trade-offs:
- Requires application-level retry logic
- Failed attempts still consume resources
Decision: Distributed Locks for Purchase Sessions
lockKey := fmt.Sprintf("purchase:%s", checkoutCode)
locked, err := uc.cache.DistributedLock(ctx, lockKey, uc.lockTimeout)
Rationale:
- Prevents duplicate purchase attempts for same checkout code
- Ensures idempotency across service instances
- Graceful handling of client retry storms
Benefits:
- Eliminates duplicate processing
- Predictable behavior during failures
- Protection against client misbehavior
Trade-offs:
- Adds latency to purchase flow
- Redis becomes critical dependency
- Lock timeouts need careful tuning
API Design Decisions
Decision: Two-Phase Flow (Checkout → Purchase)
Alternative Considered: Single-phase immediate purchase
Rationale:
- Allows users to accumulate items before committing
- Better user experience for mobile/web clients
- Reduces inventory pressure during peak traffic
Flow Design
// Phase 1: Add items to checkout (no inventory reservation)
POST /checkout?user_id=user123&id=item456
{
"code": "CHK-S-abc123-xyz789",
"items_count": 3,
"sale_ends_at": "2024-11-02T16:00:00Z"
}
// Phase 2: Atomic purchase of all checkout items
POST /purchase?code=CHK-S-abc123-xyz789
{
"success": true,
"purchased_items": [...],
"total_purchased": 2,
"failed_count": 1
}
Benefits:
- Users can browse and select multiple items
- Reduces database write pressure during browsing
- Clear separation between selection and commitment
- Supports complex client workflows
Trade-offs:
- Items not reserved during checkout phase
- Possibility of items becoming unavailable between phases
- More complex client state management
Decision: No Web Framework (Standard Library Only)
Alternative Considered: Gin, Echo, or other frameworks
Rationale:
- Minimal dependencies as per requirements
- Full control over request handling
- Better performance characteristics
- Easier to understand and debug
func (s *Server) setupRoutes() http.Handler {
mux := http.NewServeMux()
mux.HandleFunc("/health", s.healthHandler.HandleHealth())
mux.HandleFunc("/checkout", s.checkoutHandler.HandleCheckout())
mux.HandleFunc("/purchase", s.purchaseHandler.HandlePurchase())
// Apply middleware manually
handler := middleware.NewRecoveryMiddleware(s.logger)(mux)
handler = middleware.NewLoggingMiddleware(s.logger)(handler)
handler = monitoring.WrapHandler(handler)
return handler
}
Benefits:
- Zero framework dependencies
- Predictable performance characteristics
- Easy to optimize specific endpoints
- Reduced attack surface
Trade-offs:
- More boilerplate code for common functionality
- Manual implementation of middleware chain
- Less community ecosystem around patterns
Error Handling Decisions
Decision: Domain Error Mapping to HTTP Status Codes
var errorMappings = map[error]ErrorMapping{
domainErrors.ErrItemAlreadySold: {
HTTPStatus: http.StatusConflict,
Status: StatusConflict,
Message: "Items already sold",
},
domainErrors.ErrUserLimitExceeded: {
HTTPStatus: http.StatusBadRequest,
Status: StatusError,
Message: "User has reached maximum items limit",
},
}
Rationale:
- Clean separation between domain and HTTP concerns
- Consistent error responses across endpoints
- Easy to add new error types without HTTP knowledge
Benefits:
- Domain errors remain HTTP-agnostic
- Consistent client experience
- Easy testing of business logic
Trade-offs:
- Additional mapping layer
- Potential for unmapped errors
Decision: Structured JSON Error Responses
type ErrorResponse struct {
Message string `json:"message"`
Error string `json:"error,omitempty"`
Code string `json:"code,omitempty"`
}
Rationale:
- Machine-readable error information
- Consistent structure for client handling
- Debugging information when appropriate
Monitoring Design Decisions
Decision: Prometheus + Grafana Stack
Alternative Considered: Custom metrics, DataDog, New Relic
Rationale:
- Open source and self-hosted
- Industry standard for microservices
- Rich ecosystem and community support
- Cost-effective for high-volume metrics
Four Golden Signals Implementation
// Latency
HTTPRequestDuration = promauto.NewHistogramVec(...)
// Traffic
HTTPRequestsTotal = promauto.NewCounterVec(...)
// Errors
CheckoutFailureTotal = promauto.NewCounterVec(...)
// Saturation
DBConnectionsActive = promauto.NewGauge(...)
Benefits:
- Comprehensive observability out of the box
- Historical data retention and analysis
- Alerting capabilities
- Cost-effective scaling
Trade-offs:
- Additional infrastructure to manage
- Learning curve for Prometheus query language
- Storage requirements for metrics retention
Decision: Business Metrics as First-Class Citizens
// Flash sale specific metrics
SaleItemsSoldTotal = promauto.NewCounter(...)
CheckoutSuccessTotal = promauto.NewCounter(...)
PurchaseSuccessTotal = promauto.NewCounter(...)
Rationale:
- Business stakeholders need real-time visibility
- Technical metrics alone insufficient for flash sales
- Enables data-driven optimization decisions
Benefits:
- Business and technical teams share common metrics
- Real-time business intelligence
- Easier debugging of business logic issues
Performance Optimization Decisions
Decision: Connection Pooling Strategy
db.SetMaxOpenConns(100) // Total connections across all instances
db.SetMaxIdleConns(50) // Keep connections warm
db.SetConnMaxLifetime(time.Hour)
Rationale:
- Balance between connection overhead and resource usage
- Account for multiple service instances sharing database
- Prevent connection leaks and stale connections
Decision: Prepared Statements for All Queries
stmt, err := tx.PrepareContext(ctx, `
INSERT INTO items (id, sale_id, name, image_url, sold, created_at)
VALUES ($1, $2, $3, $4, $5, $6)
`)
Rationale:
- Better performance for repeated queries
- Protection against SQL injection
- Query plan caching at database level
Benefits:
- 20-30% performance improvement for bulk operations
- Enhanced security posture
- Consistent query performance
Trade-offs:
- Additional complexity in query handling
- Memory usage for statement caching
Security Design Decisions
Decision: Input Validation at Handler Level
func (h *CheckoutHandler) HandleCheckout() http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
userID := r.URL.Query().Get("user_id")
itemID := r.URL.Query().Get("id")
errors := make(map[string]string)
if userID == "" {
errors["user_id"] = "user_id is required"
}
if itemID == "" {
errors["item_id"] = "item_id is required"
}
if len(errors) > 0 {
response.WriteValidationError(w, "Validation failed", errors)
return
}
}
}
Rationale:
- Defense in depth strategy
- Clear validation error messages
- Prevents invalid data from reaching business logic
Decision: No Authentication/Authorization Layer
Rationale:
- Focus on core business logic as per requirements
- Authentication typically handled by API gateway
- Simplifies testing and demonstration
Note: Production deployment would require:
- JWT or OAuth2 integration
- Rate limiting per user
- API key management
- Request signing