Observability - Pranav-SA/thesis-support-examples GitHub Wiki

Setup

Performing chaos experiments is incomplete without observing the impact on the microservices or applications. We can make use of the open source tools to do the same. Find a ready-to-use sample stack installation below:

$ export ISTIO_RELEASE_URL=https://raw.githubusercontent.com/istio/istio/release-1.14/
#change the version as per need.

Prometheus: Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

$ kubectl apply -f $ISTIO_RELEASE_URL/samples/addons/prometheus.yaml

Grafana: Grafana is an open source solution for running data analytics, pulling up metrics that make sense of the massive amount of data & to monitor our apps with the help of cool customizable dashboards.

$ kubectl apply -f $ISTIO_RELEASE_URL/samples/addons/grafana.yaml

Kiali: Kiali is a management console for Istio service mesh. Kiali can be installed as an Istio add-on or trusted as a part of your production environment. Kiali enables the user to observe the flow of traffic efficiently.

$ kubectl apply -f $ISTIO_RELEASE_URL/samples/addons/kiali.yaml

Jaeger: Jaeger is a popular open-source distributed tracing tool. It is used to monitor and troubleshoot applications based on microservices architecture. It provides instrumentation libraries that were built on OpenTracing standards.

$ kubectl apply -f $ISTIO_RELEASE_URL/samples/addons/jaeger.yaml

Port-forward into the services to visit the UI.

Metrics

For APM purposes there are several metrics that are monitored such as Golden Signal (latency, error, traffic, and saturation) and many more. For observing the impact of chaos engineering over a long term there are a few metrics, namely:

MTTR Mean time to repair (MTTR) is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device. The crucial word in the definition above is repairable. Items that are not susceptible to repairs don’t fall under the umbrella of MTTR. MTTR reflects the time it takes an organization to react to unplanned incidents and put its gear, equipment, and devices back to work again. This metric calculates the time passed from the beginning of an incident until the moment it’s solved.
MTBF MTBF stands for “mean time between failures.” That’s an interesting KPI because, like the previous one, it has to do with malfunctioning devices or assets. However, unlike the previous metric, MTBF is all about the devices themselves. While MTTR represents how quickly an organization can react when unexpected problems occur, MTBF indicates the level of quality and reliability of assets.
MTTF MTTF stands for “mean time to failure.” In short, this metric refers to the average life span of a given asset. The only major difference between MTBF and MTTF is that MTBF is usually reserved for repairable items. MTTF, on the other hand, is used in scenarios where fixing an item isn’t an option. We can say that MTTF represents an expectation. It sets the amount of time one can expect a given asset to work reliably until it fails.
MTTA MTTA (mean time to acknowledge) is the average time it takes from when an alert is triggered to when work begins on the issue. This metric is useful for tracking the team’s responsiveness and the alert system's effectiveness.

Note: MTTR is all about an organization: it indicates how promptly it reacts when unexpected problems happen. MTBF, on the other hand, is more about assets. It indicates the expected time it takes for a given, repairable item to present issues. MTTF is closely related to MTBF, to the point of being mistaken for it or used interchangeably. The two metrics are, in fact, almost the same, with one major difference. While MTBF refers to repairable items, MTTF refers to the average life span of a nonrepairable asset. So use this metric for things that can’t be replaced. Depending on how one chooses to do it, that can cause a chain reaction that benefits the whole organization. For instance, MTTD (mean time to detect) might be considered the base of all of the other metrics. That makes sense: if it wanted to be able to react to problems quickly (MTTR), improving the expected life span of assets (MTBF) then being able to quickly detect incidents is of paramount importance.

From the perspective of IT Management and business SLAs, a few other metrics which are highly used in denoting reliability assessment can be used in this context:

RPO RPO can be understood as the time period of data loss that the applications suffer dating back from the time of the incident to when the last known good status of data is available for recovery. This can be understood like it is a service level objective or a measure of loss tolerance. What period of time is realistically acceptable by the organization to suffer data loss when the storage failure affects data access? So, if RPO is defined as 12 hours in the business continuity plan and the last known available data backup before the outage is from 9 hours ago, then the RPO threshold has not been violated.
RTO RTO is also another service level objective that is used to set the target expectation for the IT team to get the service operational again. RTO denotes the period of time the organization defines as the service level to restore the affected service since the event of a disruption (in our case due to a storage issue). For example, the RTO for a high availability scenario can be set as 5 minutes for a small incident like a disk failure, which necessitates a mirror copy to be made active. In the case of a disaster recovery (DR) scenario, where the primary site and DR site are separated by a long distance, TBs of data backup needs to be made available at the DR site (typically through remote replication), many connections have to be reconfigured and services restarted means RTO can be many hours or even days.
RTA RTA refers to the actual time period elapsed to complete the data recovery and make the storage copy available for application access. While RTO is the estimated value set as a target, RTA is the actual time measured against it. For good data governance and compliance, RTA achieved must be lesser than the RTO set in the Business Continuity /Disaster Recovery plan. In some cases, IT teams simulate a DR-like scenario in a test environment (parallel to and independent of production) and examine the effectiveness of their backup and recovery tool by measuring RTA. If there is a significant time gap between the estimated RTO and actual RTA, one may need to revisit the failover strategy to ensure that the switch from source to target happens faster.