Reliability - dennisholee/notes GitHub Wiki

Reliability - Design for something that goes wrong Security - Design for adversary trying to make things go wrong

Redundancy - Increases reliability but also increases the attack surface

To address component failures, the system design should incorporate redundancy and distinct failure domains so that you can limit the impact of failures by rerouting requests.

Defense in depth is the application of multiple, sometimes redundant, defense mechanisms. Distinct failure domains limit the “blast radius” of a failure and therefore also increase reliability.

Development efficiency and velocity

How efficiently can developers iterate on new features?
How efficiently can developers understand and modify or debug existing code?

Deployment velocity

How long does it take from the time a feature is developed to the time this feature is available to users/customers?

Tactics, Techniques and Procedures (TTPs)

Strategies

Zero Trust Networking

Zero Trust Zero Touch

Zero Trust Workforce: Authenticate users and continuously monitor and govern their access and privileges
Zero Trust Workloads: Enforce controls across the entire application stack, especially connections between containers or hypervisors in the public cloud
Zero Trust Data: Secure and manage data, categorize, and develop data classification schema, and encrypt data at rest and in transit https://ironsphere.com/2020/06/zero-trust-plus-zero-touch-equals-exponential-benefits/
Automation
APIs
Multi-party approval

Safe Proxies

Provide a way to address new reliability and security requirements without requiring substantial changes to deployed systems
Framework that allows authorized persons to access or modify the state of physical servers, virtual machines, or particular applications.
Is a single point of entry between users or automation tools and the target system that enables the following:
- Fine grained access controls
- Policy enforcements
- Audit each action
- Mitigate human mistakes by employing multi-party authorization, etc.

Compartmentalizing Permissions

Restring the Scope of Credentials

Multi-Party Authorization

Load Balancing

Principle of Least Privilege

Independent Encryption Layers for Sensitive Data

Load Shedding

Maximum connections concept too imprecise
Under heavy load, thread contention, context switching, garbage collection, and I/O contention
The goal of load shedding is to keep latency low for the requests that the server decides to accept so that the service replies before the client times out

https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/

Incident Management

US Government's Incident Command System (ICS)

Disaster Recovery Testing Program (DiRT)

simulates various internal system failures