SRECon19 Day1 - kimschles/schlesinger-knowledge GitHub Wiki

Day 1: Comprehension, Understandability and Predictability

What Breaks our Systems: A Taxonomy of Black Swans by Laura Nolan of Slack

A black swan is an outlier event, it's hard to predict and has a severe impact
A term from Nassem Talib
White swans are 'easy' to resolve
Every black swan is unique, but there are patterns
- Hitting limits
- Spreading slowness
- Thundering herds
- Automation interactions
- Cyber attacks
- Dependency problems

Hitting Limits

Physical system limits
Defense: load and capacity testing
- Include cloud services (let your provider know)
- Include write loads
  - Use a replica of prod
  - Go beyond your current size
Defense: monitoring
- When the monitor is triggered, have instructions about what to do to fix it

Spreading Slowness

Defense: fail fast
- Enforce deadlines for all requests - in and out
- Consider circuit breaker pattern
  - Limit retries from a client
Defense: Dashboards
- Problem: some resource is saturated
- Track this on a dashboard: utilization, saturation and errors

Thundering Herds

Can be users
More often from systems: cron jobs, mobile clients updating at once, large batch jobs
Defense: plan and test
- any internet-facing service can face a thundering herd
- Plan for degraded modes, which requests can be dropped?, queing input that can be processed asynchronously

Automation Interactions

Defense: control
- Create constraints to limit automation operations
- Provide ways to disable automation
- All automation should log to one searchable place hard

Cyberattacks

Defense: smaller blast radius
- Separate prod from non-prod as much as possible
- Break prod systems in multiple zones

Dependency Problems

Defense: layer and test
- Layer your infrastructure
- Regularly test the process of starting your infrastructure up
- Beware of soft dependencies: they can easily become Further Reading:
Release It! by Michael T Nygards

Complexity: The Crucial Ingredient in Your Kitchen by Casey Rosenthal of Verica.io

Question: how do we make systems reliable?

Challenger Case Study

Functionality has redundancy
Deviation is experience based (works on my machine)
Issue is self-limiting

Avoiding Risk

Exposure to risk is how we learn to deal with it, so don't avoid risk

Simplicity

Accidental complexity: it is added gradually over time
Essential complexity: it is there on purpose

Economic Pillars of Complexity

States (adding features)
Relationships (micro-services and k8s increases the number of relationships)
Environment (cloud provider or on prem?)
Reversibility (build features in chunks so you can rollback)

Software Engineering: the Bureaucratic Profession

Our industry separates who decides what will be done and who builds the project Think of a well-run kitchen
Lots of tasks are

tl;dr

Embrace complexity and navigate it
Provide opportunities for teams to practice working together
Tolerate inefficiencies

Case Study: Implementing SLOs for a New Service by Arnaud Lawson of SquareSpace

Definitions

Ceph Object Storage (COS)
S3-compatible
geo-distributed
SLOs and SLIs
- Service level objectives
- Set performance and reliability targets for a service as seen by its users over a period of time
- Service level indicators
- Example SLO: API availability SLO: 99.9% of API requests will not fail over n weeks
- Example SLI: The percentage of API requests that do not fail

SLO implementation process

Determine SLI types that best capture our users' experience
- Understand how users interact with COS
- Understand COS components and choose SLI types that best reflect user' experience
  - request-driven RESTFUL interface
Define SLIs, the thing to measure
- For the request-driven HTTP server
  - Availability SLI: % of requests that do not fail
  - Latency SLI: % of requests that complete in less than x seconds
Choose how to measure these SLIs
- Collect SLIs from COS load balancer logs
- Instrument COS S3 client programs
- Deply probers which perform common user actions
Collect SLIs for a few weeks to get a baseline
- Deployed probers
- Record success and latency metrics per request type
Infer error budgets from initial SLOs
- Example: 99.9% availability over 4 weeks -> 0.1% requests could fail over 4 weeks

Conclusion

SLIs inform decisions for prioritizing reliability projects, doing capacity planning, etc
SLI graphs help id service issues
Users easily determine whether our service is appropriate for a particular use case based on SLOs
Use SLIs for monitoring and don't have to be paged if we're within the SLO
Choose a metrics collection service with a powerful query language
Data durability SLO implementation for storage systems can be tricky

Tips for SLOs

Never strive for 100% reliability
Understand the components of the system
Know how users interact wit h the system
Collect SLIs that measure the aspects of the system that matter to users

Fixing On-Call When Nobody Thinks It's Too Broken by Tony Lykke of Hudson River Trading

Why so much noise?

That's how its always been
'Snowflake noise': special systems or integrations
We can't reduce noise unless we've got big corp money
It's better than it used to be

9 Really Hard Steps to Reduce Pager Noise

Understand your audience
- Consider why the team has it's current attitude toward pages
Understand the problem
- Find the data
- Look at your incident history in PagerDuty
- Use graphs to help your data analysis
Understand the system
- What technologies are you using?
- What does the code look like?
- How is automation involved?
Devise a Game Plan
- This doesn't have to be comprehensive
- Go after low-risk, high-impact changes first
- Communicate the plan and ask for feedback
- Listen to the data
Get Permission (optional)
- Ask for forgieness instead of permission?
- Use the data you've collected
- Over-communicate
- You will break things. Let the on-call person know what you're trying to do
Lay the Groundwork
- Neglect creates technical debt
- Make your changes
- Setup CI/CD
Fix the Lowest Hanging Fruit
- A data visualization may show you the low hanging fruit
Communicate, Communicate, Communicate
- Blog posts
- RFCs
- Documentation
- Announcements
Go Back to Step 7

Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Teams' Value by Aaron Wieczorek, USDS

The USDS was created after the redo of healthcare.gov
Currently, USDS steps in when there is a crisis
- Example: airnow.gov during 2018 California Wildfires

How do we find these problems before they are a crisis?

Monitor every .gov service
There are ~25,000 services and apps for .gov and .mil

Custom solution as MVP

Scripts that send requests
Python requests and CLI

To build out the monitoring

Prometheous, grafana and influxdb

Lessons Learned

Proactive monitoring allows immediate incident response
Sometimes targets don't like it when you send lot of requests in 3-5 minutes
Dashboards with this many endpoints are hard
- What kind of time-series data are you pulling down?
Alerting is hard
Tuning monitoring settings for a large system is hard

Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way by Michael Kehoe and Todd Palino of Linkedin

slides
Code Yellow: when LinkedIn uses this term to id when their SRE team needs extra attention, time or support
Code Yellow at LinkedIn
Liz Fong Jones's Summary of the talk

When You've Declared Code Yellow

Problem Statement:
- Admit there is a problem
- Measure it
- Understand it
- Determine the underlying causes that need to be fixed
Exit Criteria
- Define concrete goals
- Define success criteria
- Define timelines
Get the help your require
- Ask other teams for help
- Get dedicated engineers, PMs, etc.
- Timebound
Planning
- Plan out short-term work
- Plan long-term projects
- Prioritize work that will reduce toil and burnout
Communication and Partnerships
- Communicate problem statement and exit criteria
- Send regular progress updates
- Ensure that stakeholders understand delays and expected outcomes

Create a Code Review Culture by Jonathan Turner of Squarespace

Code Reviews are Useful Because...

it ensures higher quality code
is a communication platform
provides an opportunity to teach

Be intentional about your culture by...

explicitly describe what your culture entails
establishing a community of experts
developing new experts
training code reviewers

Advice for Code Authors

Make the reviewer's life easier by communicating as much context as you can
Establish your PR style with a PR guide
- If no guide, write a good description
Make the PR a manageable size
- What's the smallest vertical slice of functionality meaningful to your users?

Advice for Code Reviewers

Automate the nits
Know when to take the PR review offline
Communicate mutual respect
- be as thorough as the PR needs
- Review in passes. (Make a master PR review checklist)
John's PR Review Checklist:
- Size it up (what's the shape of the PR, is the PR the right size?)
- Context
  - What is the PR trying to accomplish?
  - Why is this PR trying to accomplish that?
  - Does the PR accomplish what it says?
- Relevance
  - Is the change necessary?
  - Is code the right solution?
  - Are there other people that should be aware of this PR?
- Readability
  - Is the the change understood without knowing the specific language?
  - Are any esoteric language features being used?
- Production Readiness
  - How will we know when this breaks?
  - Is there new documentation required?
  - Are there tests that prevent regression?
  - Is the change secure?
- Naming
  - Do names communicate what things do?
  - Are the names of things idiomatic to the language?
  - Do the names leak implementation details?
- Gotchas
  - What are ways the code can break?
  - Is the code subject to any common programming gotchas?
  - Is spelling correct and consistent?
- Language specific
  - Is the code well designed?
  - Is the code idiomatic to the language?
  - Are new patterns introduced?
  - Does the code fall prey to common pitfalls of the language?
John's Code Review Checklist

Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance by Lynn Root of Spotify

Liz Fong-Jones's Tweet Thread

Tracing Overview

A trace follows a complete workflow: the start of a request to its end
- It's easy to trace a simple request
- When you have services, there are a lot more places the data flows through
Historically, tracing has been machine-centric. We want workflow-centric tracing
Workflow-centric tracing lets you see dependencies

Why trace?

Performance analysis
Anomaly detection
Profiling (interested in just one component)
Resource attribution
Workload modeling
- You can begin asking 'what if?' questions

Approaches to Tracing

Manual
Blackbox
Metadata propagation

Four Things to Think About

What relationships will you track?
How to track them
Which sampling approach to take
How to visualize

How to Sample

Head-based
- Makes random sampling decisions at the beginning of the workflow
Tail-based
- Makes decisions at the end of the workflow
Unitary

What to visualize?

Gantt charts only show requests from a single trace
Request flow graph
Context calling tree