SRE - bobbae/gcp GitHub Wiki

SRE is what you get when you treat operations as if it’s a software problem.

Site reliability engineering (SRE) is a software engineering approach to IT operations. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks. ... In this way, SRE helps to improve the reliability of a system today, while also improving it as it grows over time.

SRE classroom is a collection of workshops developed by Google's Site Reliability Engineering group. The goals of this workshop are to (1) introduce participants to the principles of non-abstract large systems design (NALSD), and (2) provide hands-on experiences with applying these principles to the design and evaluation of these systems. We consider NALSD a concept fundamental to SRE, and understanding its principles provides a basis for having meaningful conversations about the design and operation of large software systems.

Terminology

https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-sli-vs-slo-vs-sla

Cloud Ops Sandbox

https://github.com/GoogleCloudPlatform/cloud-ops-sandbox

Examples

Creating SLO in sandbox

https://medium.com/google-cloud/measuring-reliability-in-gcp-step-by-step-slo-creation-guide-using-cloud-operation-sandbox-99043bd0e70f

SLO Risk analysis

https://cloud.google.com/blog/products/devops-sre/how-sres-analyze-risks-to-evaluate-slos