Reliability - dennisholee/notes GitHub Wiki

Reliability - Design for something that goes wrong Security - Design for adversary trying to make things go wrong

Redundancy - Increases reliability but also increases the attack surface

To address component failures, the system design should incorporate redundancy and distinct failure domains so that you can limit the impact of failures by rerouting requests.

Defense in depth is the application of multiple, sometimes redundant, defense mechanisms. Distinct failure domains limit the “blast radius” of a failure and therefore also increase reliability.

Development efficiency and velocity

  • How efficiently can developers iterate on new features?
  • How efficiently can developers understand and modify or debug existing code?

Deployment velocity

  • How long does it take from the time a feature is developed to the time this feature is available to users/customers?

Tactics, Techniques and Procedures (TTPs)

Strategies

Zero Trust Networking

Zero Trust Zero Touch

  • Zero Trust Workforce: Authenticate users and continuously monitor and govern their access and privileges

  • Zero Trust Workloads: Enforce controls across the entire application stack, especially connections between containers or hypervisors in the public cloud

  • Zero Trust Data: Secure and manage data, categorize, and develop data classification schema, and encrypt data at rest and in transit https://ironsphere.com/2020/06/zero-trust-plus-zero-touch-equals-exponential-benefits/

  • Automation

  • APIs

  • Multi-party approval

Safe Proxies

  • Provide a way to address new reliability and security requirements without requiring substantial changes to deployed systems
  • Framework that allows authorized persons to access or modify the state of physical servers, virtual machines, or particular applications.
  • Is a single point of entry between users or automation tools and the target system that enables the following:
    • Fine grained access controls
    • Policy enforcements
    • Audit each action
    • Mitigate human mistakes by employing multi-party authorization, etc.

Compartmentalizing Permissions

Restring the Scope of Credentials

Multi-Party Authorization

Load Balancing

Principle of Least Privilege

Independent Encryption Layers for Sensitive Data

Load Shedding

  • Maximum connections concept too imprecise
  • Under heavy load, thread contention, context switching, garbage collection, and I/O contention
  • The goal of load shedding is to keep latency low for the requests that the server decides to accept so that the service replies before the client times out

https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/

Incident Management

US Government's Incident Command System (ICS)

Disaster Recovery Testing Program (DiRT)

  • simulates various internal system failures