System design QA - rs-hash/GETTHATJOB GitHub Wiki

1. Handling Millions of Requests per Second

Interviewer: “Your system needs to handle millions of requests per second. What’s your approach?”

You: I’ll start with a load balancer to evenly distribute traffic across multiple application servers and I will also use a Reverse Proxy like NGINX or AWS ALB for added routing intelligence.

2. Server Failures and High Availability

Interviewer: “What happens if a server goes down?”

You: I’d replicate data across multiple servers and set up a heartbeat mechanism to detect failures. Failed servers are replaced using auto-scaling in cloud services.

3. Scaling for Traffic Spikes

Interviewer: “What if traffic spikes overnight?”

You: Horizontal scaling, I’d add more servers dynamically behind the load balancer using auto-scaling groups or Kubernetes clusters.

I will also use caching layers like Redis or Memcached to reduce backend load.

4. Storing Large Volumes of Data

Interviewer: “How would you store terabytes or petabytes of data?”

You: I’d shard the database and distribute data across multiple nodes using techniques like consistent hashing to avoid data imbalance.

5. Ensuring Data Durability

Interviewer: “How do you ensure no data is lost?”

You: Replication, I’d keep multiple copies of data using a Primary-Replica setup or a Leaderless Replication model. For mission-critical systems, multi-region replication ensures disaster recovery.

6. Dealing with Write Performance Issues

Interviewer: “Won’t replication slow down writes?”

You: It depends on whether we prioritize strong consistency or eventual consistency.

For strong consistency, wait until writes propagate to all replicas before returning success.

For high write throughput, go with eventual consistency like DynamoDB or Cassandra.

7. Conflict Resolution in Distributed Systems

Interviewer: “How do you handle conflicting writes in distributed databases?”

You: Use techniques like vector clocks or timestamps to track versions of data. Conflicts can be resolved during reads using application logic.

8. Write-Heavy Use Cases

Interviewer: “When would you use leaderless replication?”

You: Leaderless systems like Cassandra are great for high-write use cases, where speed matters more than consistency—e.g., logging systems or IoT data collection.

9. Quorums for Read/Write Operations

Interviewer: “What’s a quorum, and when would you use it?”

You: A quorum is the minimum number of nodes that must confirm an operation for it to succeed. For example:

Write quorum ensures data is stored safely.

Read quorum ensures the most recent data is retrieved.

10. Real-World Example

Interviewer: “Can you give an example where this design is applied?”

You: Imagine a video streaming platform like YouTube:

When a user uploads a video, it’s stored in a distributed file system.

A pub-sub pattern (e.g., Kafka) triggers tasks like video processing for different resolutions (720p, 1080p).

Replication ensures videos are available in multiple regions for faster playback.