Distributed Systems Observability - kimschles/schlesinger-knowledge GitHub Wiki

Distributed Systems Observability by Cindy Sridharan

Context: Cloud Native Technologies, serverless paradigms, and all the new hotness make it hard to see system failures because so many things are abstractions on top of abstractions on top of abstractions.

Chapter 1: The Need for Observability

Observability is a property of a system that assumes the following things:

  • No system is 100% healthy

  • Distributed systems are unpredictable in how they fail

  • We cannot predict all the ways that systems will fail

  • Failure must be embraced

  • Easier debugging is required for maintanence and growth of systems

  • Observability != monitoring

Observable Systems should be designed with the following practices in place:

  • Tested
  • Failure modes should surface during testing
  • Deployed incrementally and rollback is metrics show weirdness
  • Post-release, systems should report data and its health and behavior

Chapter 2: Monitoring and Observability

figure 2-1

Observability is testing and monitoring combined

Chapter 3: Coding and Testing for Observability

Paradigm shifts:

  • Companies used to have QA departments that would test software pre-production
  • Now, most modern dev teams write their own tests, and test software in an envioronment that is as close to production as possible, but not production
  • The next step: test in production, and write code and testing for failure instead of for success

Coding for Failure *