System Design Concepts - ashishranjandev/interview-wiki GitHub Wiki

Scalability and Reliability

Scalability

Capability of a system, process or a network to grow and manage increased demand.
Systems that can evolve to support growing amount of work is considered scalable.

Still system speeds decreases with scale

Slower Network because of distance between machines.
tasks may be atomic in nature.
flaw in system design

Vertical vs Horizontal Scalability

Vertical scalability has a hard limit. After that its super expensive or not possible.
Vertical scalability is simpler. No application change or an abstraction layer.
OS Design or application itself may limit vertical scaling. MySQL will have lock contention (especially if using old MYSQL Storage engine - MyISAM).
I/O capacity can be added by adding more hard drive in RAID(Redundant Array of Independent disks). RAID 10 has both redundancy and throughput. App treats it as single disk.
SSDs have faster random reads and writes. However MySQL tries mostly for sequential I/O. Databases like Cassandra always use Sequential I/O for writes and most reads. Hence SSDs are not of much here as sequential writes and reads are not much faster.
Horizontal scalbility is harder to achieve. Either built from scratch or added later(requires lot of development efforts).
No Hard Limit
Cost efficient .. just add more servers when needed.
Lot of development and design effort needed .. hence we should target by sequence and only till needed.
Web servers-> Cache -> DBs -> Other persistence stores.

Global Scalability

GeoDNS

Edge Cache

Different types of Communication Protocols

SSE

Web sockets

Short polling

Long polling

Web hooks

API Design

REST API

GraphQL

gPRC

Availability and Efficiency

Serviceability and Manageability

Load balancing

Caching

Expiration Policy

Consistency

Data modifying operation on the data store and cache are not in a single transaction.

Migrating Failures

It represents single point of failures. We over provision memory by certain percentage.

Eviction Policy

CDN

Network of geographically dispersed servers used to deliver static content.

CDN also has dynamic content caching now - https://aws.amazon.com/cloudfront/dynamic-content/

Considerations :

Cost
Cache Expiry
CDN Fallback
Invalidating Files

Data Partitioning, Criteria and Challenges

Challenges

Resharding Data - Consistent Hashing could help us avoid constant resharding.
Celebrity Problem - a shard contains a record with high volume - we need to allocate 1 shard for them.
Join and De-normalisation - Joining across shards do not work. Hence we need to denormalise.

Indexes

How data is written Internally in NoSQL(LSM Trees and Bloom Filter) and SQL (Write Ahead Log)

Proxy

Proxy Server Types

Redundancy and Replication

CAP Theorem

Consistent Hashing

In distributed system data has to stored on multiple servers to enable horizontal scaling.

How to distribute data evenly among servers? Common Answer is - Simple Hashing

Here We pass key through Hash function to find out server number.

serverIndex = hash(key) % N 
where N is size of server pool

A good hashing function distributes the key evenly across entire range.

Assumption: As long as number of servers remain the same. An object key will always map to the server.

Problem is when a server is added or if a server is removed.

To keep the environment consistent all the records would need to be moved even if they did not belong to the server which went down.

If the addition/removal of the server happens frequently because of cache missed the system would be unusable.

Hence we need a system which would not be impacted so much by frequent addition and removal of servers.

Consistent hashing is a special kind of hashing such that when a hash table is re-sized and consistent hashing is used, only k/n keys need to be remapped on average, where k is the number of keys, and n is the number of slots.

Along with Hashing Object Key we hash server names as well.

We chose a hash function which results is a range of values - x0 to xn e.g. for SHA-1 it is 0 to 2^160. So x0 =0 and xn = 2^160-1.

We connect x0 and xN to create a ring.

Hash Space

Now we place all the keys and servers on the ring based on the hash output.

Server Distribution

In which server the key will be stored? - We go clockwise.

In this way if a new server is removed. We just need to move the keys that were stored on the server. In this way if a new server is added. We just need to move the keys which are just anti-clockwise of the server till an existing server. Other keys are not effected.

Server-Addition

Server-Removal

Problem with the design? The server and the keys might not get distributed evenly. Solution? Virtual Nodes.

We create multiple nodes for a single server on the ring. More the nodes - More even distribution but more space need to store metadata . Hence a tradeoff.

Nodes

Uses:

Product	Use
Dynamo DB, Cassandra	Data Partitioning
CDN	Distribute Web Content Evenly
Google Load Balancers	Distribute Persistent Connections evenly among backend servers

System Design Concepts - ashishranjandev/interview-wiki GitHub Wiki

Scalability and Reliability

Scalability

Vertical vs Horizontal Scalability

Global Scalability

GeoDNS

Edge Cache

Different types of Communication Protocols

SSE

Web sockets

Short polling

Long polling

Web hooks

REST API

GraphQL

gPRC

Availability and Efficiency

Serviceability and Manageability

Load balancing

Caching

Expiration Policy

Consistency

Migrating Failures

Eviction Policy

CDN

Data Partitioning, Criteria and Challenges

Indexes

How data is written Internally in NoSQL(LSM Trees and Bloom Filter) and SQL (Write Ahead Log)

Proxy

Proxy Server Types

Redundancy and Replication

CAP Theorem

Consistent Hashing

Long Polling vs WebSockets vs Server Sent Events

⚠️ GitHub.com Fallback ⚠️

System Design Concepts - ashishranjandev/interview-wiki GitHub Wiki

Scalability and Reliability

Scalability

Vertical vs Horizontal Scalability

Global Scalability

GeoDNS

Edge Cache

Different types of Communication Protocols

SSE

Web sockets

Short polling

Long polling

Web hooks

REST API

GraphQL

gPRC

Availability and Efficiency

Serviceability and Manageability

Load balancing

Caching

Expiration Policy

Consistency

Migrating Failures

Eviction Policy

CDN

Data Partitioning, Criteria and Challenges

Indexes

How data is written Internally in NoSQL(LSM Trees and Bloom Filter) and SQL (Write Ahead Log)

Proxy

Proxy Server Types

Redundancy and Replication

CAP Theorem

Consistent Hashing

Long Polling vs WebSockets vs Server Sent Events

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️