Chaos Architecture - kimschles/schlesinger-knowledge GitHub Wiki

Chaos Architecture

Adrian Cockcroft, AWS Gluecon: May 16, 2018

What should your system do when something fails?
- Stop or
- Carry on with reduced functionality
If a permissions look up fails, should you stop or continue?
- Paper: Memories, Guesses and Apologies by Pat Helland
Do you have a backup datacenter?
- How often do you failover apps to it?
How do you know that your system works?
- Drift Into Failure by Sydney Dekker
- Release It! by Michael Nygard

Users Application Switching Infrastructure

Chaos engineering are responsible for creating 'fire drills' for users
- You find the weaknesses in a system
Chaos engineering tools:
- Game days
- Simian Army (OS project from Netflix)
- Chaostookit
- ChAP (chaos automated platform)
- Gremlin
Red Team tools:
- Safestack AVA
- Infection monkey
- Chaosslingr
- AttackIQ
- SafeBreach
A new trend: Blending the industrial view of safety with the software view
- Todd Konklin
- John Allspaw
synpotic illegibility: parts of the process are invisible and cannot be written down. You can't write a synopsis.
Hypothesis testing

Amazon Aurora DB Cluster Fault Injection Queries
- You can crash the master or a replica
IAM Region Restriction
- Simulate regional API outages by changing the list of permitted regions
Kubernetes
- Gremlin attacks
- Open Source Chaos Toolkit
CNCF Chaos Working Group

Things to research: