Chaos Architecture - kimschles/schlesinger-knowledge GitHub Wiki
Chaos Architecture
Adrian Cockcroft, AWS Gluecon: May 16, 2018
The cloud offers benefits
- Fast: Companeies can get up and running quickly
- Scale: you can grow more easily
- Strategic: Datacenter Replacements
- Geographically distributed systems
Architecture Questions
- What should your system do when something fails?
- Stop or
- Carry on with reduced functionality
- If a permissions look up fails, should you stop or continue?
- Paper: Memories, Guesses and Apologies by Pat Helland
- Do you have a backup datacenter?
- How often do you failover apps to it?
- How do you know that your system works?
- Drift Into Failure by Sydney Dekker
- Release It! by Michael Nygard
You can predict and prepare for every failure
- Instead, tet good at fast detections and response (Chris Pinkham)
Chaos Architecture
- 4 Layers:
Users Application Switching Infrastructure
Chaos Engineering
-
Chaos engineering are responsible for creating 'fire drills' for users
- You find the weaknesses in a system
-
Chaos engineering tools:
- Game days
- Simian Army (OS project from Netflix)
- Chaostookit
- ChAP (chaos automated platform)
- Gremlin
-
Red Team tools:
- Safestack AVA
- Infection monkey
- Chaosslingr
- AttackIQ
- SafeBreach
-
A new trend: Blending the industrial view of safety with the software view
- Todd Konklin
- John Allspaw
-
synpotic illegibility: parts of the process are invisible and cannot be written down. You can't write a synopsis.
-
Hypothesis testing
AWS
- Amazon Aurora DB Cluster Fault Injection Queries
- You can crash the master or a replica
- IAM Region Restriction
- Simulate regional API outages by changing the list of permitted regions
- Kubernetes
- Gremlin attacks
- Open Source Chaos Toolkit
- CNCF Chaos Working Group
Bottom Line
- Expensive recovery is being replaced by low cost, automated chaos engineering
Things to research:
- What is a security certificate?
- DB and application torture