Distributed Systems Observability - kimschles/schlesinger-knowledge GitHub Wiki
Distributed Systems Observability by Cindy Sridharan
Context: Cloud Native Technologies, serverless paradigms, and all the new hotness make it hard to see system failures because so many things are abstractions on top of abstractions on top of abstractions.
Chapter 1: The Need for Observability
Observability is a property of a system that assumes the following things:
-
No system is 100% healthy
-
Distributed systems are unpredictable in how they fail
-
We cannot predict all the ways that systems will fail
-
Failure must be embraced
-
Easier debugging is required for maintanence and growth of systems
-
Observability != monitoring
Observable Systems should be designed with the following practices in place:
- Tested
- Failure modes should surface during testing
- Deployed incrementally and rollback is metrics show weirdness
- Post-release, systems should report data and its health and behavior
Chapter 2: Monitoring and Observability
Observability is testing and monitoring combined
Chapter 3: Coding and Testing for Observability
Paradigm shifts:
- Companies used to have QA departments that would test software pre-production
- Now, most modern dev teams write their own tests, and test software in an envioronment that is as close to production as possible, but not production
- The next step: test in production, and write code and testing for failure instead of for success
Coding for Failure *