Observability - bcgov/common-service-showcase GitHub Wiki

Observability

Observability involves activities such as measuring, collecting, and analyzing many diagnostic signals from a system. Signals such as metrics, traces, logs, events and profiles are pieces of information that can assist with understanding and diagnosing the behavior of the system. Of course, there are many facets of observability as it can range from simply the responsiveness and availability of a service to the inner workings of memory and CPU utilization breakdowns of the service itself.

Concepts

One of the aspects of effective performance monitoring is understanding specifically what aspects of observability are needed. For some services, it may suffice to know that the service is available and reachable. Other services may need higher resolution information such as how much memory and CPU pressure it is under, and potentially the volume of network traffic being generated, and where it is going to. In general, we can conceptually break down observability into three categories: metrics, logging and tracing.

Metrics

Metrics involve the quantitative aspects of the system such as CPU and memory usage, the number of threads and processes used, how many open connections it has, and what the average latency of a network request is. This information is useful for understanding how to properly do Resource Allocation Tuning and management of a system. Metrics can also prove useful for understanding general system behavior, and can also be used to detect anomalous behavior such as when it begins to use an unexpectedly high amount of CPU or memory.

Metrics answer the question "How is the system performance changing over a finite span of time?"

Logging

Logging focuses on providing application-specific insights by collecting the messages emitted by the system processes. These context-sensitive messages can provide a view into how a system reacts to requests and generally provide diagnostic data to developers on whether the system is doing the correct thing or not. Normally this would be the console log output of an application, but logs may also be written to a specified file or through other means as well.

Logging answers the question "What happened in the system at this moment in time?"

Tracing

Tracing provides insight into the lifecycle of a specific transaction or set of transaction requests through a "trace". Unlike metrics and logging which slice information from a system perspective, tracing slices information on an event per event basis. In more complex environments, a request may end up passing through multiple endpoints and microservices before it is satisfied. The act of tracing keeps track of a specific event, how long it spends on each relevant system, and how much time it takes to traverse the entire ecosystem.

Tracing answers the question "How are the multiple system components interacting with each other?"

Performance Monitoring

One of the main goals of performance monitoring is to be able to notify and alert system maintainers when something within the system may have started to misbehave. Especially in mission-critical environments where services need to provide some level of Service Level Agreement, it becomes critical to operations that any system degredation incidents are immediately detected and alerted so that the maintainers can begin mitigation actions.

When system degredation incidents occur, the maintenance team will need to be alerted to the situation as fast as possible. Monitoring tools will need to have the ability to set up measurable and triggerable alerts based on the collected data. When something goes out of bounds, we ideally want the monitoring system to dispatch alerts to the appropriate channels so that the maintenance team can react to it.

Once a maintenace team is on the incident case, they will want to figure out the potential root causes. Questions such as "Did a dependent service go down?", "Was there a bug in the designed system?", "Did a platform level system fail?" and "Was there any anomalous behavior leading up to the incident?" will be questions that need to be answered. Depending on what the issue is, a combination of metrics, logging and or tracing data will be needed in order to identify the root cause and begin applying mitigation.

Tool Selection Considerations

In order to achieve this, we need to first ensure that metrics, logging and tracing data are collected and aggregated correctly. There are numerous tools that could achieve this. A few notable tools are the Elastic/ELK Stack, Prometheus, OpenTelemetry, UptimeRobot, Jaeger, Splunk and Zipkin.

While each tool has its specific strengths in each of the aspects of observability, it will ultimately come down to what kinds of questions your team needs to have answered in order to provide SLA level assurances for a system. For example, if you have a relatively atomic ecosystem, tracing may be less of a concern, and focus should be spent on gathering metrics and logging. On the other hand, if your ecosystem has many small moving and interconnected parts, your team would want to prioritize tracing and logging instead. We quickly classify each of the aformentioned tools into which of the three observability concepts it appears to focus on:

  • Elastic/Elk Stack - metrics, logging
  • Prometheus - metrics, logging
  • OpenTelemetry - metrics, tracing
  • UptimeRobot - metrics (superficial)
  • Jaeger - tracing
  • Splunk - metrics, (logging?)
  • Zipkin - tracing