System Design ‐ Core Principles - FullstackCodingGuy/Developer-Fundamentals GitHub Wiki
read
SOLID principle is quite famous in OOP. There are 5 components to it.
- SRP (Single Responsibility Principle)
- OCP (Open Close Principle)
- LSP (Liskov Substitution Principle)
- ISP (Interface Segregation Principle)
- DIP (Dependency Inversion Principle)
read
"Keep it simple, stupid!" is a design principle first noted by the U.S. Navy in 1960. It states that most systems work best if they are kept simple.
There are many areas where this rule applies in programming. Two very important ones are:
a) Subprogram behavior and length: Subprograms should do precisely ONE conceptual task and no more. The length of a subprogram should allow it to be easily visually inspected; generally no more that one page in length. Similarly you should generally not mix input/output and algorithmic logic in the same subprogram; it is alway a goal to separate I/O from logic.
b) If a problem can be decomposed into two or more independently solvable problems, then solve them independently and after you have implemented and tested the independent solutions, then combine them into the larger result. This is sometimes known as "Gall's Law":
"A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system."
read
It is a code refactoring rule of thumb to decide when a replicated piece of code should be replaced by a new procedure. It states that you are allowed to copy and paste the code once, but that when the same code is replicated three times, it should be extracted into a new procedure.
The rule was introduced by Martin Fowler in his text "Refactoring" and attributed to Don Roberts.
Duplication in programming is almost always in indication of poorly designed code or poor coding habits. Duplication is a bad practice because it makes code harder to maintain.
When the rule encoded in a replicated piece of code changes, whoever maintains the code will have to change it in all places correctly. This process is error-prone and often leads to problems.
If the code exists in only one place, then it can be easily changed there. This rule is can even be applied to small number of lines of code, or even single lines of code.
For example, if you want to call a function, and then call it again when it fails, it's OK to have two call sites; however, if you want to try it five times before giving up, there should only be one call site inside a loop rather than 5 independent calls.
read
- Consider using Cache - if the system is read-heavy
- Use Cache & CDN - if need a low latency system
- Use Message Queue for async processing - If the system is Write Heavy
- Use RDBMS / SQL Databases - if system needs ACID Compliant
- Use NO-SQL Database - if the data is unstructured and does not require ACID properties
- Use Blob/Object Storage - if the system needs to store images, files, videos
- Use ElasticSearch or search index - if system requires searching data in high volume
- Use Database Sharding - if system needs to scale
- Use Load Balancer - if system requires High Availability, Performance and Throughput
- Use a Graph database - if system has data with nodes, edges and relationships like friends list, road connections etc,.
read
read
The ACID (Atomicity-Consistency-Isolation-Durability) model used in relational databases is too strict for NoSQL databases. The BASE principle offers more flexibility, choosing availability over consistency. It states that the states will eventually be consistent.read
- High Availability This means we need to ensure a high agreed level of uptime. We often describe the design target as “3 nines” or “4 nines”. “4 nines”, 99.99% uptime, means the service can only be down 8.64 seconds per day.
To achieve high availability, we need to design redundancy in the system. There are several ways to do this:
-
Hot-hot: two instances receive the same input and send the output to the downstream service. In case one side is down, the other side can immediately take over. Since both sides send output to the downstream, the downstream system needs to dedupe.
-
Hot-warm: two instances receive the same input and only the hot side sends the output to the downstream service. In case the hot side is down, the warm side takes over and starts to send output to the downstream service.
-
Single-leader cluster: one leader instance receives data from the upstream system and replicates to other replicas.
-
Leaderless cluster: there is no leader in this type of cluster. Any write will get replicated to other instances. As long as the number of write instances plus the number of read instances are larger than the total number of instances, we should get valid data.
- High Throughput This means the service needs to handle a high number of requests given a period of time. Commonly used metrics are QPS (query per second) or TPS (transaction per second).
To achieve high throughput, we often add caches to the architecture so that the request can return without hitting slower I/O devices like databases or disks. We can also increase the number of threads for computation-intensive tasks. However, adding too many threads can deteriorate the performance. We then need to identify the bottlenecks in the system and increase its throughput. Using asynchronous processing can often effectively isolate heavy-lifting components.
- High Scalability This means a system can quickly and easily extend to accommodate more volume (horizontal scalability) or more functionalities (vertical scalability). Normally we watch the response time to decide if we need to scale the system.
read
CAP theorem states that any distributed data store can only provide two of the following three guarantees: * Consistency - Every read receives the most recent write or an error. * Availability - Every request receives a response. * Partition tolerance - The system continues to operate in network faults.read
**Scalability** > Scalability is the ability of a system to handle an increasing workload, either by adding more resources (scaling out) or by upgrading the capacity of existing resources (scaling up). In distributed systems, scalability is essential to ensure that the system can effectively manage the growing demands of users, data, and processing power. Here's an overview of the different aspects of scalability:Horizontal Scaling
Horizontal scaling, also known as scaling out, involves adding more machines or nodes to a system to distribute the workload evenly. This approach allows the system to handle an increased number of requests without overloading individual nodes. Horizontal scaling is particularly useful in distributed systems because it provides a cost-effective way to manage fluctuating workloads and maintain high availability.
Vertical Scaling
Vertical scaling, or scaling up, refers to increasing the capacity of individual nodes within a system. This can be achieved by upgrading the hardware, such as adding more CPU, memory, or storage. Vertical scaling can help improve the performance of a system by allowing it to handle more workloads on a single node. However, this approach has limitations, as there is a physical limit to the amount of resources that can be added to a single machine, and it can also lead to single points of failure.
Horizontal vs. Vertical Scaling
With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool; Vertical-scaling is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime and comes with an upper limit. Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.
Availability
Availability is a measure of how accessible and reliable a system is to its users. In distributed systems, high availability is crucial to ensure that the system remains operational even in the face of failures or increased demand. It is the backbone that enables businesses to provide uninterrupted services to their users, regardless of any unforeseen circumstances. In today’s fast-paced digital world, where downtime can lead to significant financial losses and reputational damage, high availability has become a critical requirement for organizations across various industries.
What Is High Availability?
In any IT ecosystem, “availability” is the ability of a system to respond to a request. High availability, as the name suggests, refers to a system capable of responding to excessive requests with minimal downtime.
High-availability systems are not completely immune to failure and downtime. Rather, a high-availability system is designed to be responsive as regularly as possible. High availability also doesn’t reflect the speed or quality of a system’s output. It only refers to the ability of the IT system to respond to requests.
High availability as a feature is commonly seen in the case of cloud service providers, who typically assign a service level agreement (SLA) score to the availability of their cloud systems. For instance, blob storage systems such as Azure, Google Cloud, and AWS S3 all feature an availability SLA of 99.99%.
How exactly does this 99.99% availability translate to the real world? The percentage is calculated in terms of annual availability, which means that over any 365-day period, the system is guaranteed to be online 99.99% of the time.
A quick calculation would reveal that 99.99% uptime equals 0.01% downtime. This means that out of the 525,600 minutes in any 365-day period, approximately 53 minutes of downtime can be expected. That’s not even an hour in a full year!
Now comes an interesting calculation: what if a system offers 99.9% uptime instead of 99.99% uptime? Not that big a difference — or is it?
A system with 99.9% availability would feature a downtime of 0.1%, which, if we follow the same calculation as above, translates to about 8.8 hours per year. That’s a big jump from 53 minutes!
So, is 99.9% availability high or low? It depends on the application. For instance, for a super-critical system like air traffic control or payments processing, 8 hours of downtime per year is simply unacceptable. On the other hand, some applications (such as blob storage) would face limited flexibility in achieving a very high availability SLA (say 99.999%).
Simply put, the higher the aim for a system’s availability, the more complex and expensive it becomes. Let’s understand why.
Say, there exists an infrastructure as a service (IaaS) provider that requires its IT systems to be available 24/7/365. Such an enterprise would not be able to achieve its intended 100% availability SLA if it has only one server, right? This is because one server can only handle so much traffic. Plus, it is likely to experience downtime due to hardware failures or maintenance requirements.
To increase system availability, the IaaS vendor could commission more servers to handle more traffic through workload distribution. The greater the number of servers commissioned, the closer the vendor reaches their intended 100% availability mark.
However, no matter how many servers are commissioned to handle active traffic, availability can never be truly 100% because servers are bound to experience downtime at some point. The solution here would be to commission standby servers that don’t actively provide IaaS services but simply await an opportunity to take over if one of the primary servers goes down. While this increases system availability even further, it increases costs because standby servers also need to be paid for. They can’t, however, be given data to process at the same throughput as primary servers.
The higher the number of backup servers added, the closer any IT system would reach 100% availability (at an ever-increasing cost). However, as mentioned earlier, achieving true 100% availability is virtually impossible.
read
To ensure high availability and fault tolerance, load balancers should be designed and deployed with redundancy in mind. This means having multiple instances of load balancers that can take over if one fails. Redundancy can be achieved through several failover strategies:
Active-passive configuration:
In this setup, one load balancer (the active instance) handles all incoming traffic while the other (the passive instance) remains on standby. If the active load balancer fails, the passive instance takes over and starts processing requests. This configuration provides a simple and reliable failover mechanism but does not utilize the resources of the passive instance during normal operation.
Active-active configuration:
In this setup, multiple load balancer instances actively process incoming traffic simultaneously. Traffic is distributed among the instances using methods such as DNS load balancing or an additional load balancer layer. If one instance fails, the others continue to process traffic with minimal disruption. This configuration provides better resource utilization and increased fault tolerance compared to the active-passive setup.
Effective health checks and monitoring are essential components of high availability and fault tolerance for load balancers. Health checks are periodic tests performed by the load balancer to determine the availability and performance of backend servers. By monitoring the health of backend servers, load balancers can automatically remove unhealthy servers from the server pool and avoid sending traffic to them, ensuring a better user experience and preventing cascading failures.
Monitoring the load balancer itself is also crucial. By keeping track of performance metrics, such as response times, error rates, and resource utilization, we can detect potential issues and take corrective action before they lead to failures or service degradation.
In addition to regular health checks and monitoring, it is essential to have proper alerting and incident response procedures in place. This ensures that the appropriate personnel are notified of any issues and can take action to resolve them quickly.
In active-active and active-passive configurations, it is crucial to ensure that the load balancer instances maintain a consistent view of the system's state, including the status of backend servers, session data, and other configuration settings. This can be achieved through various mechanisms, such as:
Centralized configuration management:
Using a centralized configuration store (e.g., etcd, Consul, or ZooKeeper) to maintain and distribute configuration data among load balancer instances ensures that all instances are using the same settings and are aware of changes.
State sharing and replication:
In scenarios where load balancers must maintain session data or other state information, it is crucial to ensure that this data is synchronized and replicated across instances. This can be achieved through database replication, distributed caching systems (e.g., Redis or Memcached), or built-in state-sharing mechanisms provided by the load balancer software or hardware.
read
- Load balancing: Distribute network traffic across multiple servers, containers, or cloud instances. This helps to optimize resource utilization and avoid performance degradation.
- Replication: Use multiple identical versions of systems and subsystems to ensure that they provide the same results. If the results differ, a procedure can identify the faulty system.
- Backup components: Use backup components to automatically replace failed components and prevent service loss.
- Circuit breaker pattern: Wrap calls to external dependencies in a circuit breaker to monitor their health and prevent cascading failures.
- Fault isolation: Isolate faulty components or nodes to prevent failures from spreading to other parts of the system.
- Data partitioning: Distribute data across multiple machines.
- Data replication: Store each partition on multiple nodes for redundancy.
- Failure detection and recovery: Implement strategies for detecting and recovering from failures.
What Is Fault Tolerance?
As we’ve established already, failure occurs in all IT systems at some point. Once this happens, availability is maintained if the entire system goes down and another system steps in to take its place. However, what if the system within which the failure occurred continues operating without downtime? Such a system would be known as fault tolerant.
This is the main difference between high availability and fault tolerance. In a highly available system, failures can lead to downtime and the denial of requests. However, this would happen rarely, perhaps only a few minutes or hours per year.
On the other hand, fault-tolerant systems can recover from failure and will be able to continue responding to requests without a similar backup system having to take over.
Imagine the IaaS service provider example from above again. This provider has stocked up their data center with several servers and achieved high availability. But suddenly, the data center experiences a power outage. No number of primary servers or backup servers can respond to user requests since they all need a power supply to operate.
But now, our IaaS vendor has installed a well-connected backup power generator that instantly fulfills the data center’s power supply needs in case of power loss. This makes the IaaS IT systems both highly available and fault tolerant.
- Identifying and analyzing potential failure points
- Monitoring system performance in real-time
- Implementing automated tests
- Testing the system in a simulated environment
- Using distributed systems
- Taking advantage of fault detection tools
High Availability vs. Fault Tolerance Differences
- High availability
- is defined as the ability of a system to operate continuously with minimal risk of failure.
- Fault tolerance
- is defined as the ability of a system to continue operating without interruption, even if several components fail.


