Payment System Integration - ashtishad/system-design GitHub Wiki
Functional Requirements (from Interviewer):
- Users can pay for orders using credit cards (via Stripe).
- System supports merchant payouts (e.g., monthly to sellers).
- System performs reconciliation to fix inconsistencies across services.
Non-Functional Requirements:
- Transaction Consistency: No double payments or missed payouts.
- PCI DSS Compliance: Secure storage/handling of card data.
- Scalability: Handle 1M transactions/day, with surges (10% peak, e.g., Black Friday).
- Low Latency: Payment <500ms, payout <1s.
- Availability: 99.99% uptime.
- Durability: No transaction loss over 5 years.
-
Capacity Estimation (5 years):
- DAU: 10M users (assuming 10% of Amazon-scale).
- Transactions: 1M/day * 365 * 5 = 1.825B transactions.
-
Storage:
- Transactions: 1.825B * 1KB = 1.825TB.
- Users/Merchants: 10M * 1KB + 1M * 1KB = 11GB.
- Total: ~2TB raw, ~20TB with replication/indexes.
-
QPS:
- Avg: 1M/day ÷ 86,400s ≈ 12 QPS.
- Peak (10% surges): 100K users * 5 tx ÷ 2,400s ≈ 208 QPS.
- User: {id, name, token (hashed card data), payment_history}
- Merchant: {id, name, payout_account}
- Transaction: {id, user_id, merchant_id, amount, status (pending/confirmed/failed), idempotency_key, timestamp}
- Payout: {id, merchant_id, amount, status (pending/completed), timestamp}
- ReconciliationLog: {id, tx_id/payout_id, discrepancy, resolution_status}
- POST /payments/init: {user_id, merchant_id, amount, card_token, idempotency_key}, initiates payment.
- POST /payments/confirm: {tx_id, payment_method}, confirms via Stripe.
- POST /payouts/schedule: {merchant_id, amount}, schedules payout.
- GET /transactions/{user_id}: Returns transaction history.
- Client: E-commerce checkout page.
- API Gateway: Routes, auth (JWT), rate-limiting.
-
Microservices:
- Payment Service: Handles payment initiation/confirmation.
- Payout Service: Manages merchant payouts.
- Reconciliation Service: Fixes inconsistencies.
- History Service: Retrieves transaction/payout history.
-
Data Stores:
- PostgreSQL: Transactions/users (ACID for consistency).
- Redis: Idempotency, caching.
- Kafka: Event streaming for async processing.
- External: Stripe for credit card processing.
-
Flow:
- Init → Payment Service → Redis (idempotency) → PostgreSQL → Stripe.
- Confirm → Payment Service → Stripe → PostgreSQL → Kafka (accounting/analytics).
- Payout → Payout Service → PostgreSQL → Stripe → Kafka.
- Reconciliation → Reconciliation Service → PostgreSQL/Kafka.
Why PostgreSQL?
PostgreSQL ensures ACID compliance for financial transactions, critical for consistency (e.g., no double payments). Its MVCC and row-level locking handle 208 QPS efficiently, and it supports PCI DSS with encrypted columns (AES-256).
1. Transaction Consistency (Prevent Double Payments)
- Problem: Ensure no duplicate payments at 208 QPS peak.
-
Approaches & Tradeoffs:
-
Isolation Level: Serializable
- How: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; BEGIN; SELECT ... INSERT ... COMMIT;
- Pros: Prevents all anomalies (e.g., phantom reads), absolute safety.
- Cons: High contention, ~50% throughput drop (100 QPS max), aborts frequent at peak.
- Use Case: Small-scale, ultra-critical systems.
-
Isolation Level: Read Committed
- How: BEGIN; SELECT FOR UPDATE ON transactions WHERE idempotency_key = {key}; INSERT; COMMIT;
- Pros: Better concurrency (~5ms), default in PostgreSQL.
- Cons: Risks dirty reads without explicit locks, manageable with FOR UPDATE.
- Use Case: Medium-scale systems with locking.
-
Pessimistic Locking
- How: SELECT * FROM transactions WHERE idempotency_key = {key} FOR UPDATE NOWAIT; reject if exists.
- Pros: Immediate rejection, ACID-compliant.
- Cons: Lock contention (~10ms), scales poorly beyond 100 QPS without sharding.
- Use Case: Low-concurrency systems.
-
Optimistic Locking
- How: Add version to transactions; INSERT ... WHERE NOT EXISTS (SELECT ... WHERE idempotency_key = {key} AND version = {old_version});
- Pros: High concurrency (~2ms), no locks.
- Cons: Retries on conflict (5-10% at 208 QPS), complex retry logic.
- Use Case: Read-heavy systems with low write conflicts.
-
Database Constraints
- How: UNIQUE (idempotency_key) index, reject duplicate INSERT.
- Pros: Simple, zero app logic, instant rejection.
- Cons: Index overhead (~20% write penalty), no retry handling.
- Use Case: Basic deduplication without retries.
-
Hybrid (Redis + PostgreSQL)
- How: Check idempotency_key in Redis (TTL=5min), then INSERT with UNIQUE in PostgreSQL.
- Pros: Redis (<1ms) filters 99% duplicates, PostgreSQL ensures durability.
- Cons: Redis failure risks temporary duplicates (mitigated by PostgreSQL), sync complexity.
- Use Case: High-throughput fintech systems.
-
Isolation Level: Serializable
- Industry Example (Stripe): Stripe uses idempotency keys with a distributed cache (e.g., Redis) and a relational DB for final consistency, handling millions of QPS.
- Optimal Solution: Hybrid—Redis for fast idempotency checks (208 QPS, <1ms), PostgreSQL with UNIQUE constraint and Read Committed + FOR UPDATE for safety.
- Why Optimal: Balances speed (Redis: 100K ops/s) and durability (PostgreSQL: 20 nodes, 10 QPS/node), tolerates Redis downtime, aligns with PCI DSS via PostgreSQL encryption.
- Tradeoffs: Adds Redis dependency (mitigated by PostgreSQL fallback), slight latency increase (5ms vs. 2ms for optimistic).
2. PCI DSS Compliance
- Problem: Securely store/process card data per PCI DSS (Level 1).
-
Solutions & Tradeoffs:
-
Tokenization (In-House)
- How: Hash card data into ES256 JWT tokens with salting, store in PostgreSQL encrypted column.
- Pros: Full control, no third-party dependency.
- Cons: Complex key management, audit burden (~20% dev overhead).
-
Third-Party Tokenization (Stripe)
- How: Use Stripe’s tokenization, store only tokens (not card data).
- Pros: PCI scope reduced, simpler compliance (~50% less audit).
- Cons: Vendor lock-in, latency to Stripe (~50ms).
-
Encryption Only
- How: AES-256 encrypt card data at rest/in transit, no tokenization.
- Pros: Simple, no external calls.
- Cons: High PCI scope, key rotation complexity.
-
Tokenization (In-House)
- Optimal Solution: Stripe Tokenization—Store tokens in PostgreSQL, encrypt with AES-256, sync device time for integrity, use MFA + fingerprinting for auth.
- Why Optimal: Minimizes PCI scope (only tokens), leverages Stripe’s compliance, meets 500ms latency with <50ms overhead.
- Tradeoffs: Dependency on Stripe (mitigated by failover to backup processor), token refresh complexity.
3. Scalability for Surges
- Problem: 208 QPS during Black Friday.
-
Solutions & Tradeoffs:
-
Horizontal Scaling
- How: Auto-scale Payment Service (CPU > 70%), shard PostgreSQL by user_id.
- Pros: Linear scalability, ~10K QPS potential.
- Cons: Sharding complexity, cross-shard queries slow (~100ms).
-
Queueing
- How: Kafka buffers payment requests, processes async.
- Pros: Decouples load, handles 1M QPS bursts.
- Cons: Adds latency (~1s), retry logic needed.
-
Caching
- How: Redis caches user/merchant data, idempotency checks.
- Pros: Reduces DB load by 90%, <1ms.
- Cons: Cache invalidation complexity.
-
Horizontal Scaling
- Optimal Solution: Hybrid—Horizontal scaling + Kafka + Redis.
- Why Optimal: Scales to 208 QPS (20 nodes), buffers surges (Kafka: 10K msg/s), speeds reads (Redis: 90% hit rate).
- Tradeoffs: Increased infra cost (~30% more), manageable with AWS auto-scaling.
4. Merchant Payouts
- Problem: Reliable monthly payouts at scale.
-
Solutions & Tradeoffs:
-
Batch Processing
- How: Cron job aggregates sales, pays out via Stripe every 30 days.
- Pros: Simple, low overhead.
- Cons: Delayed payouts (30-day lag), manual reconciliation.
-
Event-Driven
- How: Kafka streams confirmed transactions, pays out daily in batches.
- Pros: Faster (1-day lag), automated.
- Cons: Higher throughput (~1K msg/s), complex failure handling.
-
Real-Time
- How: Pay out per transaction via Stripe Connect.
- Pros: Instant for merchants, great UX.
- Cons: High cost (2% fee/tx), 208 QPS strain.
-
Batch Processing
- Optimal Solution: Event-Driven—Kafka daily batches to Stripe.
- Why Optimal: Balances speed (1-day) and cost (~0.5% fee), handles 208 QPS with 1K msg/s capacity.
- Tradeoffs: Adds Kafka latency (~1s), mitigated by batching.
5. Reconciliation (Fix Inconsistencies)
- Problem: Resolve failures across services (e.g., Stripe vs. accounting).
-
Solutions & Tradeoffs:
-
Manual Reconciliation
- How: Daily SQL queries compare PostgreSQL vs. Stripe logs.
- Pros: Simple, no infra cost.
- Cons: Slow (~hours), error-prone.
-
Periodic Batch
- How: Cron job every 6h compares Kafka events vs. PostgreSQL/Stripe.
- Pros: Automated, catches 99% issues.
- Cons: 6h lag, misses real-time fixes.
-
Real-Time Event Sourcing
- How: Kafka streams all events, Reconciliation Service resolves live.
- Pros: Instant fixes (<1min), 100% consistency.
- Cons: High complexity, 10x infra cost.
-
Manual Reconciliation
- Optimal Solution: Periodic Batch—6h Cron with Kafka logs.
- Why Optimal: Balances cost and consistency (99.9% uptime), resolves 208 QPS failures within 6h.
- Tradeoffs: Lag acceptable for e-commerce, real-time overkill unless revenue-critical.
6. Low Latency
- Problem: <500ms at peak.
- Solution: Redis caching (user/merchant), CDN for static UI, Stripe async callbacks.
- Tech Details: Redis: <1ms, CDN: 80% hit rate, Stripe: 50ms.
7. High Availability
- Problem: 99.99% uptime.
- Solution: Multi-region PostgreSQL (3 replicas), Redis sentinel, SQS retries.
- Tech Details: Failover: 5s, SQS: 1K retries/s.
text
CollapseWrapCopy
+----------------+ +----------------+ | Client |<------->| API Gateway | | (E-commerce) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Payment Service |<---+ | | | (Tx Processing) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Payout Service |<--------| | | | | (Merchant Payouts)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Reconciliation |<--------| | | | | (Consistency) | | | | | +----------------+ | | | | +----------------+ | | | | | History Service |<--------| | | | | (Tx History) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis |<------->| PostgreSQL | | Kafka | | (Cache, Idemp.)| | (Tx, Users) | | (Event Stream) | +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | Stripe |<--------| | | | SQS | | (Payments) | | | | | (Retry Queue) | +----------------+ +-----+---------+ +----------------+
- Authentication: PostgreSQL + JWT + Redis – PayPal’s secure login.
- Transaction Consistency: PostgreSQL (ACID) + Redis – Stripe’s idempotency.
- PCI DSS: Stripe Tokenization + AES-256 – Amazon’s card security.
- Scalability: Kafka + PostgreSQL Replicas – Klarna’s surge handling.
- Payouts: Kafka + Stripe – Shopify’s merchant payments.
- Reconciliation: Periodic Kafka Batch – PayPal’s consistency checks.
- Low Latency: Redis + CDN – Afterpay’s fast checkout.
- Availability: Multi-Region PostgreSQL – Visa’s reliable uptime.