Payment System Integration - ashtishad/system-design GitHub Wiki

1. Requirements

Functional Requirements (from Interviewer):

  • Users can pay for orders using credit cards (via Stripe).
  • System supports merchant payouts (e.g., monthly to sellers).
  • System performs reconciliation to fix inconsistencies across services.

Non-Functional Requirements:

  • Transaction Consistency: No double payments or missed payouts.
  • PCI DSS Compliance: Secure storage/handling of card data.
  • Scalability: Handle 1M transactions/day, with surges (10% peak, e.g., Black Friday).
  • Low Latency: Payment <500ms, payout <1s.
  • Availability: 99.99% uptime.
  • Durability: No transaction loss over 5 years.
  • Capacity Estimation (5 years):
    • DAU: 10M users (assuming 10% of Amazon-scale).
    • Transactions: 1M/day * 365 * 5 = 1.825B transactions.
    • Storage:
      • Transactions: 1.825B * 1KB = 1.825TB.
      • Users/Merchants: 10M * 1KB + 1M * 1KB = 11GB.
      • Total: ~2TB raw, ~20TB with replication/indexes.
    • QPS:
      • Avg: 1M/day ÷ 86,400s ≈ 12 QPS.
      • Peak (10% surges): 100K users * 5 tx ÷ 2,400s ≈ 208 QPS.

2. Core Entities

  • User: {id, name, token (hashed card data), payment_history}
  • Merchant: {id, name, payout_account}
  • Transaction: {id, user_id, merchant_id, amount, status (pending/confirmed/failed), idempotency_key, timestamp}
  • Payout: {id, merchant_id, amount, status (pending/completed), timestamp}
  • ReconciliationLog: {id, tx_id/payout_id, discrepancy, resolution_status}

3. APIs

  • POST /payments/init: {user_id, merchant_id, amount, card_token, idempotency_key}, initiates payment.
  • POST /payments/confirm: {tx_id, payment_method}, confirms via Stripe.
  • POST /payouts/schedule: {merchant_id, amount}, schedules payout.
  • GET /transactions/{user_id}: Returns transaction history.

4. High-Level Design

  • Client: E-commerce checkout page.
  • API Gateway: Routes, auth (JWT), rate-limiting.
  • Microservices:
    • Payment Service: Handles payment initiation/confirmation.
    • Payout Service: Manages merchant payouts.
    • Reconciliation Service: Fixes inconsistencies.
    • History Service: Retrieves transaction/payout history.
  • Data Stores:
    • PostgreSQL: Transactions/users (ACID for consistency).
    • Redis: Idempotency, caching.
    • Kafka: Event streaming for async processing.
  • External: Stripe for credit card processing.
  • Flow:
    • Init → Payment Service → Redis (idempotency) → PostgreSQL → Stripe.
    • Confirm → Payment Service → Stripe → PostgreSQL → Kafka (accounting/analytics).
    • Payout → Payout Service → PostgreSQL → Stripe → Kafka.
    • Reconciliation → Reconciliation Service → PostgreSQL/Kafka.

Why PostgreSQL?

PostgreSQL ensures ACID compliance for financial transactions, critical for consistency (e.g., no double payments). Its MVCC and row-level locking handle 208 QPS efficiently, and it supports PCI DSS with encrypted columns (AES-256).


5. Deep Dives (90-Minute Conversation)

1. Transaction Consistency (Prevent Double Payments)

  • Problem: Ensure no duplicate payments at 208 QPS peak.
  • Approaches & Tradeoffs:
    • Isolation Level: Serializable
      • How: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; BEGIN; SELECT ... INSERT ... COMMIT;
      • Pros: Prevents all anomalies (e.g., phantom reads), absolute safety.
      • Cons: High contention, ~50% throughput drop (100 QPS max), aborts frequent at peak.
      • Use Case: Small-scale, ultra-critical systems.
    • Isolation Level: Read Committed
      • How: BEGIN; SELECT FOR UPDATE ON transactions WHERE idempotency_key = {key}; INSERT; COMMIT;
      • Pros: Better concurrency (~5ms), default in PostgreSQL.
      • Cons: Risks dirty reads without explicit locks, manageable with FOR UPDATE.
      • Use Case: Medium-scale systems with locking.
    • Pessimistic Locking
      • How: SELECT * FROM transactions WHERE idempotency_key = {key} FOR UPDATE NOWAIT; reject if exists.
      • Pros: Immediate rejection, ACID-compliant.
      • Cons: Lock contention (~10ms), scales poorly beyond 100 QPS without sharding.
      • Use Case: Low-concurrency systems.
    • Optimistic Locking
      • How: Add version to transactions; INSERT ... WHERE NOT EXISTS (SELECT ... WHERE idempotency_key = {key} AND version = {old_version});
      • Pros: High concurrency (~2ms), no locks.
      • Cons: Retries on conflict (5-10% at 208 QPS), complex retry logic.
      • Use Case: Read-heavy systems with low write conflicts.
    • Database Constraints
      • How: UNIQUE (idempotency_key) index, reject duplicate INSERT.
      • Pros: Simple, zero app logic, instant rejection.
      • Cons: Index overhead (~20% write penalty), no retry handling.
      • Use Case: Basic deduplication without retries.
    • Hybrid (Redis + PostgreSQL)
      • How: Check idempotency_key in Redis (TTL=5min), then INSERT with UNIQUE in PostgreSQL.
      • Pros: Redis (<1ms) filters 99% duplicates, PostgreSQL ensures durability.
      • Cons: Redis failure risks temporary duplicates (mitigated by PostgreSQL), sync complexity.
      • Use Case: High-throughput fintech systems.
  • Industry Example (Stripe): Stripe uses idempotency keys with a distributed cache (e.g., Redis) and a relational DB for final consistency, handling millions of QPS.
  • Optimal Solution: Hybrid—Redis for fast idempotency checks (208 QPS, <1ms), PostgreSQL with UNIQUE constraint and Read Committed + FOR UPDATE for safety.
  • Why Optimal: Balances speed (Redis: 100K ops/s) and durability (PostgreSQL: 20 nodes, 10 QPS/node), tolerates Redis downtime, aligns with PCI DSS via PostgreSQL encryption.
  • Tradeoffs: Adds Redis dependency (mitigated by PostgreSQL fallback), slight latency increase (5ms vs. 2ms for optimistic).

2. PCI DSS Compliance

  • Problem: Securely store/process card data per PCI DSS (Level 1).
  • Solutions & Tradeoffs:
    • Tokenization (In-House)
      • How: Hash card data into ES256 JWT tokens with salting, store in PostgreSQL encrypted column.
      • Pros: Full control, no third-party dependency.
      • Cons: Complex key management, audit burden (~20% dev overhead).
    • Third-Party Tokenization (Stripe)
      • How: Use Stripe’s tokenization, store only tokens (not card data).
      • Pros: PCI scope reduced, simpler compliance (~50% less audit).
      • Cons: Vendor lock-in, latency to Stripe (~50ms).
    • Encryption Only
      • How: AES-256 encrypt card data at rest/in transit, no tokenization.
      • Pros: Simple, no external calls.
      • Cons: High PCI scope, key rotation complexity.
  • Optimal Solution: Stripe Tokenization—Store tokens in PostgreSQL, encrypt with AES-256, sync device time for integrity, use MFA + fingerprinting for auth.
  • Why Optimal: Minimizes PCI scope (only tokens), leverages Stripe’s compliance, meets 500ms latency with <50ms overhead.
  • Tradeoffs: Dependency on Stripe (mitigated by failover to backup processor), token refresh complexity.

3. Scalability for Surges

  • Problem: 208 QPS during Black Friday.
  • Solutions & Tradeoffs:
    • Horizontal Scaling
      • How: Auto-scale Payment Service (CPU > 70%), shard PostgreSQL by user_id.
      • Pros: Linear scalability, ~10K QPS potential.
      • Cons: Sharding complexity, cross-shard queries slow (~100ms).
    • Queueing
      • How: Kafka buffers payment requests, processes async.
      • Pros: Decouples load, handles 1M QPS bursts.
      • Cons: Adds latency (~1s), retry logic needed.
    • Caching
      • How: Redis caches user/merchant data, idempotency checks.
      • Pros: Reduces DB load by 90%, <1ms.
      • Cons: Cache invalidation complexity.
  • Optimal Solution: Hybrid—Horizontal scaling + Kafka + Redis.
  • Why Optimal: Scales to 208 QPS (20 nodes), buffers surges (Kafka: 10K msg/s), speeds reads (Redis: 90% hit rate).
  • Tradeoffs: Increased infra cost (~30% more), manageable with AWS auto-scaling.

4. Merchant Payouts

  • Problem: Reliable monthly payouts at scale.
  • Solutions & Tradeoffs:
    • Batch Processing
      • How: Cron job aggregates sales, pays out via Stripe every 30 days.
      • Pros: Simple, low overhead.
      • Cons: Delayed payouts (30-day lag), manual reconciliation.
    • Event-Driven
      • How: Kafka streams confirmed transactions, pays out daily in batches.
      • Pros: Faster (1-day lag), automated.
      • Cons: Higher throughput (~1K msg/s), complex failure handling.
    • Real-Time
      • How: Pay out per transaction via Stripe Connect.
      • Pros: Instant for merchants, great UX.
      • Cons: High cost (2% fee/tx), 208 QPS strain.
  • Optimal Solution: Event-Driven—Kafka daily batches to Stripe.
  • Why Optimal: Balances speed (1-day) and cost (~0.5% fee), handles 208 QPS with 1K msg/s capacity.
  • Tradeoffs: Adds Kafka latency (~1s), mitigated by batching.

5. Reconciliation (Fix Inconsistencies)

  • Problem: Resolve failures across services (e.g., Stripe vs. accounting).
  • Solutions & Tradeoffs:
    • Manual Reconciliation
      • How: Daily SQL queries compare PostgreSQL vs. Stripe logs.
      • Pros: Simple, no infra cost.
      • Cons: Slow (~hours), error-prone.
    • Periodic Batch
      • How: Cron job every 6h compares Kafka events vs. PostgreSQL/Stripe.
      • Pros: Automated, catches 99% issues.
      • Cons: 6h lag, misses real-time fixes.
    • Real-Time Event Sourcing
      • How: Kafka streams all events, Reconciliation Service resolves live.
      • Pros: Instant fixes (<1min), 100% consistency.
      • Cons: High complexity, 10x infra cost.
  • Optimal Solution: Periodic Batch—6h Cron with Kafka logs.
  • Why Optimal: Balances cost and consistency (99.9% uptime), resolves 208 QPS failures within 6h.
  • Tradeoffs: Lag acceptable for e-commerce, real-time overkill unless revenue-critical.

6. Low Latency

  • Problem: <500ms at peak.
  • Solution: Redis caching (user/merchant), CDN for static UI, Stripe async callbacks.
  • Tech Details: Redis: <1ms, CDN: 80% hit rate, Stripe: 50ms.

7. High Availability

  • Problem: 99.99% uptime.
  • Solution: Multi-region PostgreSQL (3 replicas), Redis sentinel, SQS retries.
  • Tech Details: Failover: 5s, SQS: 1K retries/s.

System Design Diagram

text

CollapseWrapCopy

+----------------+ +----------------+ | Client |<------->| API Gateway | | (E-commerce) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Payment Service |<---+ | | | (Tx Processing) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Payout Service |<--------| | | | | (Merchant Payouts)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | Reconciliation |<--------| | | | | (Consistency) | | | | | +----------------+ | | | | +----------------+ | | | | | History Service |<--------| | | | | (Tx History) | +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis |<------->| PostgreSQL | | Kafka | | (Cache, Idemp.)| | (Tx, Users) | | (Event Stream) | +----------------+ +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+ +----------------+ | Stripe |<--------| | | | SQS | | (Payments) | | | | | (Retry Queue) | +----------------+ +-----+---------+ +----------------+


Summary of Solutions and Industry Practices

  • Authentication: PostgreSQL + JWT + Redis – PayPal’s secure login.
  • Transaction Consistency: PostgreSQL (ACID) + Redis – Stripe’s idempotency.
  • PCI DSS: Stripe Tokenization + AES-256 – Amazon’s card security.
  • Scalability: Kafka + PostgreSQL Replicas – Klarna’s surge handling.
  • Payouts: Kafka + Stripe – Shopify’s merchant payments.
  • Reconciliation: Periodic Kafka Batch – PayPal’s consistency checks.
  • Low Latency: Redis + CDN – Afterpay’s fast checkout.
  • Availability: Multi-Region PostgreSQL – Visa’s reliable uptime.
⚠️ **GitHub.com Fallback** ⚠️