Mastering Distributed Tracing - zhuje/openshift-wiki GitHub Wiki
Mastering Distributed Tracing by Yuri Shkuro, 2019 edition
All page references correlate to this edition.
Introduction
- Chapter 1 Why Distributed Tracing
- Chapter 2 HodRod (tutorial)
- Chapter 3 Distributed Tracing Fundamentals
Data Gathering Problems
- Chapter 4 Instrumenting OpenTracing(now merged with OpenTelemetry)(*skip)
- Chapter 5 Instrumenting Async Applications (*skip)
- Chapter 6 Tracing Standards and Ecosystems
- Chapter 7 Service Mesh
- Chapter 8 Sampling
Getting Value from Tracing - Chapter 9 Turning the Lights On
- Chapter 10 Distributed Context Propagation
- Chapter 11 Integration with Metrics and Logs
- Chapter 12 Gathering Insights with Data Mining Deploying and Operating Tracing Infrastructure
- Chapter 13 Implementing Tracing in Large Organizations
- Chapter 14 Under the Hood of Distributed Tracing Systems
Why do we need Distributed Tracing
"Modern, internet-scale, cloud-native applications are very complex distributed systems. Building them is hard, and debugging them is even harder. The growing popularity of microservices and function-as-service only exacerbates the problem..." (p3)
"Traditional monitoring tools were designed for monolith systems, observing the health and behavior of a single application instance. They may be able to tell us a story about that single instance, but they know almost nothing about the distributed transaction that passed through it. These tools [metrics and logging] lack the context of the request." (p11)
As soon as we start building a distributed system, traditional monitoring tools begin struggling with providing observability for the whole system, because they were designed to observer a single component, such as a program, a server, or a network switch. The story of a single component may no doubt be very interesting, but it tells us very little about the story of a request that touches many of those components. We need to know what happens to the request in all of them, en-to-end, we want to understand why a system is behaving pathologically. In other words, we first want a macro view. (p13)
What is 'Observability'
The term "observability" in control theory states that the system is observable if the internal states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs...Metrics, logs, and traces can all be used as a means to extract those signals from the application. We can then reserve the term "observability" for situations when we have a human operator proactively asking questions that were not predefined. (p7)
What is Monitoring
"Monitoring does not require a human operator; it can and should be fully automated...[It] is better thought of as the process of observing certain a priori defined performance indicators of our software system, such as those measuring an import on the end user experience, like latency or error counts, and using their values to alert us when these signals indicate an abnormal behavior of the system...Traditionally the term monitoring was used to describe metrics collection and alerting.
What details are captured in each 'trace'
- A trace is "a graph of events and causal edges between them". Visualization of traces includes Gantt charts.
- Trace is composed of :
- contextual metadata: metadata is passed throughout the request (e.g. metadata is passed from one component to another over a network)
- trace points: events with relevant data (e.g. url of the http request, or sql statement of a query)
- causality references: reference points to prior events
JZ summary
Need for Distributed Tracing
The architecture of applications is moving towards distributed systems (e.g. microservices). And reciprocally, we need a way to debug in distributed systems because traditional debugging tools, such as metrics and logging, are created for monolith applications.
Monitoring vs. Observability
Colloquially monitoring and observability are used interchangeably, but there is a technical difference. Monitoring requires predefined performance indicators and should be automated. While observability doesn't have predefined performance indicators and needs a human operator. Monitoring is typically used for metrics collection and alerting; an alert is triggered when a performance indicator is met, like a 1s latency or 100 error counts. These preset values alert us to indicate abnormal behavior in our system. In comparison, observability needs a human operator to ask questions when encountering abnormal behavior we haven't seen before.
Distributed Tracing is End to End Observability
Distributed Tracing, compared to traditional monitoring/observability tools like metrics and logging, allows a macro-view into a system. It allows a request to be followed from end to end, from component to component. It allow us to establish causal relationships between components and sniff out failures or inefficiencies in our system.
JZ Notes: This chapter provides a framework for distributed tracing. It gives an architectural overview and navigates us from the instrumentation of trace points to the visualization of trace data.
"The basic concepts of distributed tracing appears to be very straightforward:
- instrumentation is inserted into chosen points of the program's code (tracepoints) and produces profiling data when executed
- the profiling data is collected in a central location, correlated to a specific execution (request), arranged in the causality order, and combined into a trace that can be visualized or further analyzed" (p63)
"...identify causally-related activities, is arguably the most distinctive feature of distributed tracing" (p63)
"The global execution identifier needs to be passed along the execution flow. This is achieved via a process known as metadata propagation or distributed context propagation." (p64)
"The key disadvantage of metadata propagation-based tracing is the expectation of a white-box system whose components can be modified accordingly" (p65)
- Trace points : We instrument applications with trace points to create breadcrumbs for causality between traces and to capture profiling data.
- Trace API/Tracing Library: We then use the Trace API and Tracing library tools to transfer trace points to the next step, collection/normalization.
- Collection/Normalization (Trace Model): We take all the trace points, which can be and fit them into a common trace model representation.
- Trace Storage: Traces are indexed by the execution identifier.
- Trace Reconstruction: The execution identifier allows us to reconstruct the execution flow.
- Data Mining: The execution identifier also allows for further data analysis like data mining.
- Presentation/Visualization: Once the trace is reconstructed or aggregated for a particular query, we can visualize this data. For example, Gantt charts.
[TODO -- insert Figure 3.4 Anatomy of distributed tracing, from p66]