SRECon19 Day1 - kimschles/schlesinger-knowledge GitHub Wiki
Day 1: Comprehension, Understandability and Predictability
What Breaks our Systems: A Taxonomy of Black Swans by Laura Nolan of Slack
- A black swan is an outlier event, it's hard to predict and has a severe impact
- A term from Nassem Talib
- White swans are 'easy' to resolve
- Every black swan is unique, but there are patterns
- Hitting limits
- Spreading slowness
- Thundering herds
- Automation interactions
- Cyber attacks
- Dependency problems
Hitting Limits
- Physical system limits
- Defense: load and capacity testing
- Include cloud services (let your provider know)
- Include write loads
- Use a replica of prod
- Go beyond your current size
- Defense: monitoring
- When the monitor is triggered, have instructions about what to do to fix it
Spreading Slowness
- Defense: fail fast
- Enforce deadlines for all requests - in and out
- Consider circuit breaker pattern
- Limit retries from a client
- Defense: Dashboards
- Problem: some resource is saturated
- Track this on a dashboard: utilization, saturation and errors
Thundering Herds
- Can be users
- More often from systems: cron jobs, mobile clients updating at once, large batch jobs
- Defense: plan and test
- any internet-facing service can face a thundering herd
- Plan for degraded modes, which requests can be dropped?, queing input that can be processed asynchronously
Automation Interactions
- Defense: control
- Create constraints to limit automation operations
- Provide ways to disable automation
- All automation should log to one searchable place hard
Cyberattacks
- Defense: smaller blast radius
- Separate prod from non-prod as much as possible
- Break prod systems in multiple zones
Dependency Problems
- Defense: layer and test
- Layer your infrastructure
- Regularly test the process of starting your infrastructure up
- Beware of soft dependencies: they can easily become Further Reading:
- Release It! by Michael T Nygards
Complexity: The Crucial Ingredient in Your Kitchen by Casey Rosenthal of Verica.io
Question: how do we make systems reliable?
Challenger Case Study
- Functionality has redundancy
- Deviation is experience based (works on my machine)
- Issue is self-limiting
Avoiding Risk
- Exposure to risk is how we learn to deal with it, so don't avoid risk
Simplicity
- Accidental complexity: it is added gradually over time
- Essential complexity: it is there on purpose
Economic Pillars of Complexity
- States (adding features)
- Relationships (micro-services and k8s increases the number of relationships)
- Environment (cloud provider or on prem?)
- Reversibility (build features in chunks so you can rollback)
Software Engineering: the Bureaucratic Profession
- Our industry separates who decides what will be done and who builds the project Think of a well-run kitchen
- Lots of tasks are
tl;dr
- Embrace complexity and navigate it
- Provide opportunities for teams to practice working together
- Tolerate inefficiencies
Case Study: Implementing SLOs for a New Service by Arnaud Lawson of SquareSpace
Definitions
- Ceph Object Storage (COS)
- S3-compatible
- geo-distributed
- SLOs and SLIs
- Service level objectives
- Set performance and reliability targets for a service as seen by its users over a period of time
- Service level indicators
- Example SLO: API availability SLO: 99.9% of API requests will not fail over n weeks
- Example SLI: The percentage of API requests that do not fail
SLO implementation process
- Determine SLI types that best capture our users' experience
- Understand how users interact with COS
- Understand COS components and choose SLI types that best reflect user' experience
- request-driven RESTFUL interface
- Define SLIs, the thing to measure
- For the request-driven HTTP server
- Availability SLI: % of requests that do not fail
- Latency SLI: % of requests that complete in less than x seconds
- For the request-driven HTTP server
- Choose how to measure these SLIs
- Collect SLIs from COS load balancer logs
- Instrument COS S3 client programs
- Deply probers which perform common user actions
- Collect SLIs for a few weeks to get a baseline
- Deployed probers
- Record success and latency metrics per request type
- Infer error budgets from initial SLOs
- Example: 99.9% availability over 4 weeks -> 0.1% requests could fail over 4 weeks
Conclusion
- SLIs inform decisions for prioritizing reliability projects, doing capacity planning, etc
- SLI graphs help id service issues
- Users easily determine whether our service is appropriate for a particular use case based on SLOs
- Use SLIs for monitoring and don't have to be paged if we're within the SLO
- Choose a metrics collection service with a powerful query language
- Data durability SLO implementation for storage systems can be tricky
Tips for SLOs
- Never strive for 100% reliability
- Understand the components of the system
- Know how users interact wit h the system
- Collect SLIs that measure the aspects of the system that matter to users
Fixing On-Call When Nobody Thinks It's Too Broken by Tony Lykke of Hudson River Trading
Why so much noise?
- That's how its always been
- 'Snowflake noise': special systems or integrations
- We can't reduce noise unless we've got big corp money
- It's better than it used to be
9 Really Hard Steps to Reduce Pager Noise
- Understand your audience
- Consider why the team has it's current attitude toward pages
- Understand the problem
- Find the data
- Look at your incident history in PagerDuty
- Use graphs to help your data analysis
- Understand the system
- What technologies are you using?
- What does the code look like?
- How is automation involved?
- Devise a Game Plan
- This doesn't have to be comprehensive
- Go after low-risk, high-impact changes first
- Communicate the plan and ask for feedback
- Listen to the data
- Get Permission (optional)
- Ask for forgieness instead of permission?
- Use the data you've collected
- Over-communicate
- You will break things. Let the on-call person know what you're trying to do
- Lay the Groundwork
- Neglect creates technical debt
- Make your changes
- Setup CI/CD
- Fix the Lowest Hanging Fruit
- A data visualization may show you the low hanging fruit
- Communicate, Communicate, Communicate
- Blog posts
- RFCs
- Documentation
- Announcements
- Go Back to Step 7
Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Teams' Value by Aaron Wieczorek, USDS
- The USDS was created after the redo of healthcare.gov
- Currently, USDS steps in when there is a crisis
- Example: airnow.gov during 2018 California Wildfires
How do we find these problems before they are a crisis?
- Monitor every
.gov
service - There are ~25,000 services and apps for
.gov
and.mil
Custom solution as MVP
- Scripts that send requests
- Python requests and CLI
To build out the monitoring
- Prometheous, grafana and influxdb
Lessons Learned
- Proactive monitoring allows immediate incident response
- Sometimes targets don't like it when you send lot of requests in 3-5 minutes
- Dashboards with this many endpoints are hard
- What kind of time-series data are you pulling down?
- Alerting is hard
- Tuning monitoring settings for a large system is hard
Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way by Michael Kehoe and Todd Palino of Linkedin
- slides
- Code Yellow: when LinkedIn uses this term to id when their SRE team needs extra attention, time or support
- Code Yellow at LinkedIn
- Liz Fong Jones's Summary of the talk
When You've Declared Code Yellow
- Problem Statement:
- Admit there is a problem
- Measure it
- Understand it
- Determine the underlying causes that need to be fixed
- Exit Criteria
- Define concrete goals
- Define success criteria
- Define timelines
- Get the help your require
- Ask other teams for help
- Get dedicated engineers, PMs, etc.
- Timebound
- Planning
- Plan out short-term work
- Plan long-term projects
- Prioritize work that will reduce toil and burnout
- Communication and Partnerships
- Communicate problem statement and exit criteria
- Send regular progress updates
- Ensure that stakeholders understand delays and expected outcomes
Create a Code Review Culture by Jonathan Turner of Squarespace
Code Reviews are Useful Because...
- it ensures higher quality code
- is a communication platform
- provides an opportunity to teach
Be intentional about your culture by...
- explicitly describe what your culture entails
- establishing a community of experts
- developing new experts
- training code reviewers
Advice for Code Authors
- Make the reviewer's life easier by communicating as much context as you can
- Establish your PR style with a PR guide
- If no guide, write a good description
- Make the PR a manageable size
- What's the smallest vertical slice of functionality meaningful to your users?
Advice for Code Reviewers
- Automate the nits
- Know when to take the PR review offline
- Communicate mutual respect
- be as thorough as the PR needs
- Review in passes. (Make a master PR review checklist)
- John's PR Review Checklist:
- Size it up (what's the shape of the PR, is the PR the right size?)
- Context
- What is the PR trying to accomplish?
- Why is this PR trying to accomplish that?
- Does the PR accomplish what it says?
- Relevance
- Is the change necessary?
- Is code the right solution?
- Are there other people that should be aware of this PR?
- Readability
- Is the the change understood without knowing the specific language?
- Are any esoteric language features being used?
- Production Readiness
- How will we know when this breaks?
- Is there new documentation required?
- Are there tests that prevent regression?
- Is the change secure?
- Naming
- Do names communicate what things do?
- Are the names of things idiomatic to the language?
- Do the names leak implementation details?
- Gotchas
- What are ways the code can break?
- Is the code subject to any common programming gotchas?
- Is spelling correct and consistent?
- Language specific
- Is the code well designed?
- Is the code idiomatic to the language?
- Are new patterns introduced?
- Does the code fall prey to common pitfalls of the language?
- John's Code Review Checklist
Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance by Lynn Root of Spotify
Tracing Overview
- A trace follows a complete workflow: the start of a request to its end
- It's easy to trace a simple request
- When you have services, there are a lot more places the data flows through
- Historically, tracing has been machine-centric. We want workflow-centric tracing
- Workflow-centric tracing lets you see dependencies
Why trace?
- Performance analysis
- Anomaly detection
- Profiling (interested in just one component)
- Resource attribution
- Workload modeling
- You can begin asking 'what if?' questions
Approaches to Tracing
- Manual
- Blackbox
- Metadata propagation
Four Things to Think About
- What relationships will you track?
- How to track them
- Which sampling approach to take
- How to visualize
How to Sample
- Head-based
- Makes random sampling decisions at the beginning of the workflow
- Tail-based
- Makes decisions at the end of the workflow
- Unitary
What to visualize?
- Gantt charts only show requests from a single trace
- Request flow graph
- Context calling tree