BNPL Checkout Experience - ashtishad/system-design GitHub Wiki
Functional Requirements:
- Users can select BNPL at checkout.
- Users can split payments into installments (e.g., 4 payments).
- Merchants can receive instant payouts.
Non-Functional Requirements:
- Transaction Consistency: No double payments or missed payouts.
- Scalability: Handle surges (10% peak events, e.g., Black Friday).
- Low Latency: Checkout <500ms.
- Availability: 99.99% uptime.
-
Capacity Estimation (5 years):
- DAU: 10M users.
- Transactions: 1B/year * 5 = 5B transactions.
- Storage: ~3TB raw, ~30TB with replication/indexes.
- QPS: Avg: 32 QPS; Peak: 2,083 QPS (10% surges).
- User: {id, name, credit_limit, payment_history}
- Merchant: {id, name, payout_account}
- Transaction: {id, user_id, merchant_id, amount, installments[], status}
- Installment: {id, tx_id, amount, due_date, status}
- POST /checkout/init: {user_id, merchant_id, amount}, initiates BNPL.
- POST /checkout/confirm: {tx_id, payment_method}, confirms transaction.
- GET /transactions/{user_id}: Returns history.
- Client: Web/mobile.
- API Gateway: Routes, auth (JWT), rate-limiting.
-
Microservices:
- Checkout Service: BNPL flow, consistency.
- Payment Service: Installments, payouts (Stripe).
- History Service: Transaction history.
-
Data Stores:
- PostgreSQL: Transactions/users (ACID).
- Redis: Session caching, idempotency.
-
Flow:
- Init → Checkout Service → PostgreSQL.
- Confirm → Payment Service → Stripe → PostgreSQL.
- History → History Service → PostgreSQL.
1. Transaction Consistency (Prevent Double Payments)
- Problem: Ensure no duplicate transactions at 2,083 QPS peak.
-
Approaches & Tradeoffs:
-
Isolation Level: Serializable
- How: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; BEGIN; SELECT ... INSERT ... COMMIT;
- Pros: Prevents all concurrency anomalies (e.g., phantom reads), guarantees no duplicates.
- Cons: High contention, ~50% throughput drop (1K QPS max), aborts on conflicts.
-
Isolation Level: Read Committed
- How: Default in PostgreSQL, BEGIN; SELECT FOR UPDATE; INSERT; COMMIT;
- Pros: Better concurrency than Serializable, ~5ms latency.
- Cons: Risks dirty reads, requires explicit locking for safety.
-
Pessimistic Locking
- How: SELECT * FROM transactions WHERE tx_id = {tx_id} FOR UPDATE NOWAIT;
- Pros: Immediate rejection of duplicates, ACID-compliant.
- Cons: Lock contention at 2K QPS (~10ms), scales poorly without sharding.
-
Optimistic Locking
- How: Add version to transactions; UPDATE ... WHERE tx_id = {tx_id} AND version = {old_version};
- Pros: No locks, high concurrency (~2ms), great for reads.
- Cons: Retries on conflict (5-10% at peak), client complexity.
-
Database Constraints
- How: UNIQUE (tx_id) index, reject duplicates on INSERT.
- Pros: Simple, zero application logic, instant rejection.
- Cons: No deduplication logic (e.g., retries), index overhead (~20% write penalty).
-
Hybrid (Redis + PostgreSQL)
- How: Check tx_id in Redis (TTL=1h), then INSERT with UNIQUE in PostgreSQL.
- Pros: Redis (<1ms) filters duplicates, PostgreSQL ensures durability.
- Cons: Redis failure risks temporary duplicates (mitigated by PostgreSQL).
-
Isolation Level: Serializable
- Industry Example (Klarna): Klarna uses idempotency tokens (e.g., tx_id) with a relational DB (likely PostgreSQL) and caching to deduplicate retries, ensuring consistency.
- Optimal Solution: Hybrid—Redis for fast idempotency checks (2K QPS, <1ms), PostgreSQL with UNIQUE constraint and Read Committed + FOR UPDATE for final safety. Handles 2,083 QPS with <5ms latency, tolerates Redis downtime.
- Tech Details: Redis: 100K ops/s, PostgreSQL: 10 nodes, 200 QPS/node.
2. Scalability for Surges
- Problem: 2,083 QPS during Black Friday.
- Solution: Load balancers, PostgreSQL read replicas, Kafka for payouts.
- Tech Details: Replicas: 2K QPS, Kafka: 10K msg/s.
3. Low-Latency Checkout
- Problem: <500ms at peak.
- Solution: Redis caching (user/merchant), CDN for UI.
- Industry Example (Afterpay): Caches eligibility for speed.
- Tech Details: Redis: <1ms, CDN: 80% hit rate.
4. Merchant Payouts
- Problem: Instant payouts.
- Solution: Kafka streams to Stripe (10min batches).
- Tech Details: Kafka: 2K events/s.
5. High Availability
- Problem: 99.99% uptime.
- Solution: Multi-region PostgreSQL, SQS retries.
- Tech Details: Failover: 5s, SQS: 1K retries/s.
text
CollapseWrapCopy
+----------------+ +----------------+ | Client |<------->| API Gateway | | (Web/Mobile) | | (JWT, Routing) | +----------------+ +----------------+ | | | +----------------+ +-----+ +-----+-------+ | Checkout Service|<---+ | | | (BNPL Logic) | | | +----------------+ | | +----------------+ +-----+ +-----+-------+ | Payment Service |<--------| | | | | (Stripe, Payouts)| | | | | +----------------+ | | | | +----------------+ | | +----------+ | | History Service |<--------| | | | | (Tx History) | | | | | +----------------+ +-----+ | | | | | +----------------+ +----------------+ +----------------+ | Redis |<------->| PostgreSQL | | Kafka | | (Cache, Idemp.)| | (Tx, Users) | | (Payout Queue) | +----------------+ +----------------+ +----------------+ | +----------------+ +----------------+ | Stripe |<------->| SQS | | (Payments) | | (Retry Queue) | +----------------+ +----------------+
- Authentication: PostgreSQL + JWT + Redis – Klarna’s secure checkout.
- Transaction Consistency: PostgreSQL (ACID) + Redis – Klarna’s idempotent payments.
- Scalability: Kafka + PostgreSQL Replicas – Afterpay’s surge handling.
- Low Latency: Redis + CDN – Afterpay’s fast checkout.
- Payouts: Kafka + Stripe – Tabby.ai’s instant merchant payouts.
- Availability: Multi-Region PostgreSQL – Stripe’s reliable uptime.