Principles and Use Cases - Pranav-SA/thesis-support-examples GitHub Wiki

The Principles

Build a hypothesis around steady-state Usually, we usually want to build a hypothesis around the steady-state behavior. What that means is that we want to define what our system, or a part of it, looks like. Then, we want to perform some potentially damaging actions on the network, applications, nodes, or any other component of the system. These actions are, most of the time, very destructive. We want to create violent situations that will confirm that our state, the steady-state hypothesis, still holds. In other words, we want to validate that our system is in a specific state, perform some actions, and finish with the same validation to confirm that the state of our system did not change.
Simulate real-world events We want to try to do chaos engineering based on real-world events. It would be pointless to test things that are not likely to happen. Instead, we want to focus on replicating events that are likely to happen in our system. Our applications will go down, our networking will be disrupted, and our nodes will not be fully available all the time, and we want to check how our system behaves in these situations.
Run experiments in production We want to run chaos experiments in production. We could do it in a non-production system, but that is mostly for practice and for gaining confidence in chaos experiments. We want to experiment in production because that’s the “real” system. That’s the system at its best, and our real users are interacting with it. If we just perform chaos experiments during staging or integration, we cannot get a real picture of how the system in production behaves.
Automate experiments and run them continuously We want to automate our experiments to run continuously. It would be pointless to run an experiment only once because we could never be sure when the right moment is: When is the system in conditions under which it would produce some negative effect? Therefore, we should run the experiments continuously. That can mean every hour, every few hours, every day, every week, or every time some event is happening in our cluster. Maybe we want to run experiments every time we deploy a new release or every time we upgrade the cluster. In other words, experiments are either scheduled to run periodically or are executed as part of continuous delivery pipelines.
Minimize blast radius Finally, we want to reduce the blast radius. In the beginning, we want to start small and have a relatively small blast radius of the things that might explode. Over time, as we are increasing confidence in our work, we might expand that radius. Eventually, we might reach a level where we’re doing experiments across the whole system, but that comes later. In the beginning, we want to start small. We want our scope to be tiny.

The summary of the principles we discussed is as follows.

Build a hypothesis around a steady-state
Simulate real-world events
Run experiments in production
Automate experiments and run them continuously
Minimize blast radius

The process Now that we have defined chaos engineering and the principles behind it, we can turn our attention toward the process. To begin, we want to define a steady-state hypothesis. We want to know how the system looks before and after some actions. We want to confirm the steady-state, and then simulate some real-world events. After the events, we want to confirm the steady state again. We also want to collect metrics, observe dashboards, and have alerts that notify us when our system misbehaves. Ultimately, we’re trying very hard to disrupt the steady state, and the less damage we’re able to do, the more confidence we will have in our system.

The summary of the process we discussed is as follows.

Define the steady-state hypothesis
Confirm the steady-state
Produce or simulate “real world” events
Confirm the steady-state
Use metrics, dashboards, and alerts to confirm that the system as a whole is behaving correctly.

![image](## The Principles

Build a hypothesis around steady-state Usually, we usually want to build a hypothesis around the steady-state behavior. What that means is that we want to define what our system, or a part of it, looks like. Then, we want to perform some potentially damaging actions on the network, applications, nodes, or any other component of the system. These actions are, most of the time, very destructive. We want to create violent situations that will confirm that our state, the steady-state hypothesis, still holds. In other words, we want to validate that our system is in a specific state, performs some actions, and finish with the same validation to confirm that the state of our system did not change.
Simulate real-world events We want to try to do chaos engineering based on real-world events. It would be pointless to test things that are not likely to happen. Instead, we want to focus on replicating events that are likely to happen in our system. Our applications will go down, our networking will be disrupted, and our nodes will not be fully available all the time, and we want to check how our system behaves in these situations.
Run experiments in production We want to run chaos experiments in production. We could do it in a non-production system, but that is mostly for practice and for gaining confidence in chaos experiments. We want to experiment in production because that’s the “real” system. That’s the system at its best, and our real users are interacting with it. If we just perform chaos experiments during staging or integration, we cannot get a real picture of how the system in production behaves.
Automate experiments and run them continuously We want to automate our experiments to run continuously. It would be pointless to run an experiment only once because we could never be sure when the right moment is: When is the system in conditions under which it would produce some negative effect? Therefore, we should run the experiments continuously. That can mean every hour, every few hours, every day, every week, or every time some event is happening in our cluster. Maybe we want to run experiments every time we deploy a new release or every time we upgrade the cluster. In other words, experiments are either scheduled to run periodically or are executed as part of continuous delivery pipelines.
Minimize blast radius Finally, we want to reduce the blast radius. In the beginning, we want to start small and have a relatively small blast radius of the things that might explode. Over time, as we are increasing confidence in our work, we might expand that radius. Eventually, we might reach a level where we’re doing experiments across the whole system, but that comes later. In the beginning, we want to start small. We want our scope to be tiny.

The summary of the principles we discussed is as follows.

Build a hypothesis around a steady-state
Simulate real-world events
Run experiments in production
Automate experiments and run them continuously
Minimize blast radius

The summary of the process we discussed is as follows.

Define the steady-state hypothesis
Confirm the steady-state
Produce or simulate “real world” events
Confirm the steady-state
Use metrics, dashboards, and alerts to confirm that the system as a whole is behaving correctly.

Use cases for chaos engineering

There can be numerous permutations and combinations of things that can be done. We hope to uncover how our application reacts in these scenarios. We can validate what happens if you have improper fallback settings when a service is unavailable.

What happens when a service is not accessible, one way or another?
What happens if an app is retrying indefinitely to reach a service without having properly tuned timeouts?
What is the result of outages when an application or a downstream dependency receives too much traffic or when it is not available?
Will we experience cascading errors when a single point of failure crashes an app?
What happens when our application goes down?
What happens when there is something wrong with networking?
What happens when a node is not available?

Use cases for chaos engineering

What happens when a service is not accessible, one way or another?
What happens if an app is retrying indefinitely to reach a service without having properly tuned timeouts?
What is the result of outages when an application or a downstream dependency receives too much traffic or when it is not available?
Will we experience cascading errors when a single point of failure crashes an app?
What happens when our application goes down?
What happens when there is something wrong with networking?
What happens when a node is not available?