Distributed Systems Observability - kimschles/schlesinger-knowledge GitHub Wiki

Distributed Systems Observability by Cindy Sridharan

Context: Cloud Native Technologies, serverless paradigms, and all the new hotness make it hard to see system failures because so many things are abstractions on top of abstractions on top of abstractions.

Chapter 1: The Need for Observability

Observability is a property of a system that assumes the following things:

No system is 100% healthy
Distributed systems are unpredictable in how they fail
We cannot predict all the ways that systems will fail
Failure must be embraced
Easier debugging is required for maintanence and growth of systems
Observability != monitoring

Observable Systems should be designed with the following practices in place:

Tested
Failure modes should surface during testing
Deployed incrementally and rollback is metrics show weirdness
Post-release, systems should report data and its health and behavior

Chapter 2: Monitoring and Observability

figure 2-1

Observability is testing and monitoring combined

Chapter 3: Coding and Testing for Observability

Paradigm shifts:

Companies used to have QA departments that would test software pre-production
Now, most modern dev teams write their own tests, and test software in an envioronment that is as close to production as possible, but not production
The next step: test in production, and write code and testing for failure instead of for success

Coding for Failure *