Reliability - dennisholee/notes GitHub Wiki
Reliability - Design for something that goes wrong Security - Design for adversary trying to make things go wrong
Redundancy - Increases reliability but also increases the attack surface
To address component failures, the system design should incorporate redundancy and distinct failure domains so that you can limit the impact of failures by rerouting requests.
Defense in depth is the application of multiple, sometimes redundant, defense mechanisms. Distinct failure domains limit the “blast radius” of a failure and therefore also increase reliability.
Development efficiency and velocity
- How efficiently can developers iterate on new features?
- How efficiently can developers understand and modify or debug existing code?
Deployment velocity
- How long does it take from the time a feature is developed to the time this feature is available to users/customers?
Tactics, Techniques and Procedures (TTPs)
Strategies
Zero Trust Networking
Zero Trust Zero Touch
-
Zero Trust Workforce: Authenticate users and continuously monitor and govern their access and privileges
-
Zero Trust Workloads: Enforce controls across the entire application stack, especially connections between containers or hypervisors in the public cloud
-
Zero Trust Data: Secure and manage data, categorize, and develop data classification schema, and encrypt data at rest and in transit https://ironsphere.com/2020/06/zero-trust-plus-zero-touch-equals-exponential-benefits/
-
Automation
-
APIs
-
Multi-party approval
Safe Proxies
- Provide a way to address new reliability and security requirements without requiring substantial changes to deployed systems
- Framework that allows authorized persons to access or modify the state of physical servers, virtual machines, or particular applications.
- Is a single point of entry between users or automation tools and the target system that enables the following:
- Fine grained access controls
- Policy enforcements
- Audit each action
- Mitigate human mistakes by employing multi-party authorization, etc.
Compartmentalizing Permissions
Restring the Scope of Credentials
Multi-Party Authorization
Load Balancing
Principle of Least Privilege
Independent Encryption Layers for Sensitive Data
Load Shedding
- Maximum connections concept too imprecise
- Under heavy load, thread contention, context switching, garbage collection, and I/O contention
- The goal of load shedding is to keep latency low for the requests that the server decides to accept so that the service replies before the client times out
https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
Incident Management
US Government's Incident Command System (ICS)
Disaster Recovery Testing Program (DiRT)
- simulates various internal system failures