BNPL Checkout Experience - ashtishad/system-design GitHub Wiki

1. Requirements

Functional Requirements:

  • Users can select BNPL at checkout.
  • Users can split payments into installments (e.g., 4 payments).
  • Merchants can receive instant payouts.

Non-Functional Requirements:

  • Transaction Consistency: No double payments or missed payouts.
  • Scalability: Handle surges (10% peak events, e.g., Black Friday).
  • Low Latency: Checkout <500ms.
  • Availability: 99.99% uptime.
  • Capacity Estimation (5 years):
    • DAU: 10M users.
    • Transactions: 1B/year * 5 = 5B transactions.
    • Storage: ~3TB raw, ~30TB with replication/indexes.
    • QPS: Avg: 32 QPS; Peak: 2,083 QPS (10% surges).

2. Core Entities

  • User: {id, name, credit_limit, payment_history}
  • Merchant: {id, name, payout_account}
  • Transaction: {id, user_id, merchant_id, amount, installments[], status}
  • Installment: {id, tx_id, amount, due_date, status}

3. APIs

  • POST /checkout/init: {user_id, merchant_id, amount}, initiates BNPL.
  • POST /checkout/confirm: {tx_id, payment_method}, confirms transaction.
  • GET /transactions/{user_id}: Returns history.

4. High-Level Design

  • Client: Web/mobile.
  • API Gateway: Routes, auth (JWT), rate-limiting.
  • Microservices:
    • Checkout Service: BNPL flow, consistency.
    • Payment Service: Installments, payouts (Stripe).
    • History Service: Transaction history.
  • Data Stores:
    • PostgreSQL: Transactions/users (ACID).
    • Redis: Session caching, idempotency.
  • Flow:
    • Init → Checkout Service → PostgreSQL.
    • Confirm → Payment Service → Stripe → PostgreSQL.
    • History → History Service → PostgreSQL.

5. Deep Dives

1. Transaction Consistency (Prevent Double Payments)

  • Problem: Ensure no duplicate transactions at 2,083 QPS peak.
  • Approaches & Tradeoffs:
    • Isolation Level: Serializable
      • How: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; BEGIN; SELECT ... INSERT ... COMMIT;
      • Pros: Prevents all concurrency anomalies (e.g., phantom reads), guarantees no duplicates.
      • Cons: High contention, ~50% throughput drop (1K QPS max), aborts on conflicts.
    • Isolation Level: Read Committed
      • How: Default in PostgreSQL, BEGIN; SELECT FOR UPDATE; INSERT; COMMIT;
      • Pros: Better concurrency than Serializable, ~5ms latency.
      • Cons: Risks dirty reads, requires explicit locking for safety.
    • Pessimistic Locking
      • How: SELECT * FROM transactions WHERE tx_id = {tx_id} FOR UPDATE NOWAIT;
      • Pros: Immediate rejection of duplicates, ACID-compliant.
      • Cons: Lock contention at 2K QPS (~10ms), scales poorly without sharding.
    • Optimistic Locking
      • How: Add version to transactions; UPDATE ... WHERE tx_id = {tx_id} AND version = {old_version};
      • Pros: No locks, high concurrency (~2ms), great for reads.
      • Cons: Retries on conflict (5-10% at peak), client complexity.
    • Database Constraints
      • How: UNIQUE (tx_id) index, reject duplicates on INSERT.
      • Pros: Simple, zero application logic, instant rejection.
      • Cons: No deduplication logic (e.g., retries), index overhead (~20% write penalty).
    • Hybrid (Redis + PostgreSQL)
      • How: Check tx_id in Redis (TTL=1h), then INSERT with UNIQUE in PostgreSQL.
      • Pros: Redis (<1ms) filters duplicates, PostgreSQL ensures durability.
      • Cons: Redis failure risks temporary duplicates (mitigated by PostgreSQL).
  • Industry Example (Klarna): Klarna uses idempotency tokens (e.g., tx_id) with a relational DB (likely PostgreSQL) and caching to deduplicate retries, ensuring consistency.
  • Optimal Solution: Hybrid—Redis for fast idempotency checks (2K QPS, <1ms), PostgreSQL with UNIQUE constraint and Read Committed + FOR UPDATE for final safety. Handles 2,083 QPS with <5ms latency, tolerates Redis downtime.
  • Tech Details: Redis: 100K ops/s, PostgreSQL: 10 nodes, 200 QPS/node.

2. Scalability for Surges

  • Problem: 2,083 QPS during Black Friday.
  • Solution: Load balancers, PostgreSQL read replicas, Kafka for payouts.
  • Tech Details: Replicas: 2K QPS, Kafka: 10K msg/s.

3. Low-Latency Checkout

  • Problem: <500ms at peak.
  • Solution: Redis caching (user/merchant), CDN for UI.
  • Industry Example (Afterpay): Caches eligibility for speed.
  • Tech Details: Redis: <1ms, CDN: 80% hit rate.

4. Merchant Payouts

  • Problem: Instant payouts.
  • Solution: Kafka streams to Stripe (10min batches).
  • Tech Details: Kafka: 2K events/s.

5. High Availability

  • Problem: 99.99% uptime.
  • Solution: Multi-region PostgreSQL, SQS retries.
  • Tech Details: Failover: 5s, SQS: 1K retries/s.

System Design Diagram

text

CollapseWrapCopy

+----------------+ +----------------+ | Client |<------->| API Gateway | | (Web/Mobile) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Checkout Service|<---+ | | | (BNPL Logic) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Payment Service |<--------| | | | | (Stripe, Payouts)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | History Service |<--------| | | | | (Tx History) | | | | | +----------------+ +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis |<------->| PostgreSQL | | Kafka | | (Cache, Idemp.)| | (Tx, Users) | | (Payout Queue) | +----------------+ +----------------+ +----------------+ | +----------------+ +----------------+ | Stripe |<------->| SQS | | (Payments) | | (Retry Queue) | +----------------+ +----------------+


Summary of Solutions and Industry Practices

  • Authentication: PostgreSQL + JWT + Redis – Klarna’s secure checkout.
  • Transaction Consistency: PostgreSQL (ACID) + Redis – Klarna’s idempotent payments.
  • Scalability: Kafka + PostgreSQL Replicas – Afterpay’s surge handling.
  • Low Latency: Redis + CDN – Afterpay’s fast checkout.
  • Payouts: Kafka + Stripe – Tabby.ai’s instant merchant payouts.
  • Availability: Multi-Region PostgreSQL – Stripe’s reliable uptime.
⚠️ **GitHub.com Fallback** ⚠️