Chaos Architecture - kimschles/schlesinger-knowledge GitHub Wiki

Chaos Architecture

Adrian Cockcroft, AWS Gluecon: May 16, 2018

The cloud offers benefits

  • Fast: Companeies can get up and running quickly
  • Scale: you can grow more easily
  • Strategic: Datacenter Replacements
    • Geographically distributed systems

Architecture Questions

  • What should your system do when something fails?
    • Stop or
    • Carry on with reduced functionality
  • If a permissions look up fails, should you stop or continue?
    • Paper: Memories, Guesses and Apologies by Pat Helland
  • Do you have a backup datacenter?
    • How often do you failover apps to it?
  • How do you know that your system works?
    • Drift Into Failure by Sydney Dekker
    • Release It! by Michael Nygard

You can predict and prepare for every failure

  • Instead, tet good at fast detections and response (Chris Pinkham)

Chaos Architecture

  • 4 Layers:

Users Application Switching Infrastructure

Chaos Engineering

  • Chaos engineering are responsible for creating 'fire drills' for users

    • You find the weaknesses in a system
  • Chaos engineering tools:

    • Game days
    • Simian Army (OS project from Netflix)
    • Chaostookit
    • ChAP (chaos automated platform)
    • Gremlin
  • Red Team tools:

    • Safestack AVA
    • Infection monkey
    • Chaosslingr
    • AttackIQ
    • SafeBreach
  • A new trend: Blending the industrial view of safety with the software view

    • Todd Konklin
    • John Allspaw
  • synpotic illegibility: parts of the process are invisible and cannot be written down. You can't write a synopsis.

  • Hypothesis testing

AWS

  • Amazon Aurora DB Cluster Fault Injection Queries
    • You can crash the master or a replica
  • IAM Region Restriction
    • Simulate regional API outages by changing the list of permitted regions
  • Kubernetes
    • Gremlin attacks
    • Open Source Chaos Toolkit
  • CNCF Chaos Working Group

Bottom Line

  • Expensive recovery is being replaced by low cost, automated chaos engineering

Things to research:

  • What is a security certificate?
  • DB and application torture