AWS ‐ Concepts - FullstackCodingGuy/Developer-Fundamentals GitHub Wiki
Basics
Operational Excellence
Design Principle
The following are design principles for operational excellence in the cloud:
-
Perform operations as code: In the cloud, you can apply the same engineering discipline that you use for application code to your entire environment. You can define your entire workload (applications, infrastructure, etc.) as code and update it with code. You can script your operations procedures and automate their process by launching them in response to events. By performing operations as code, you limit human error and create consistent responses to events.
-
Make frequent, small, reversible changes: Design workloads that are scalable and loosely coupled to permit components to be updated regularly. Automated deployment techniques together with smaller, incremental changes reduces the blast radius and allows for faster reversal when failures occur. This increases confidence to deliver beneficial changes to your workload while maintaining quality and adapting quickly to changes in market conditions.
-
Refine operations procedures frequently: As you evolve your workloads, evolve your operations appropriately. As you use operations procedures, look for opportunities to improve them. Hold regular reviews and validate that all procedures are effective and that teams are familiar with them. Where gaps are identified, update procedures accordingly. Communicate procedural updates to all stakeholders and teams. Gamify your operations to share best practices and educate teams.
-
Anticipate failure: Perform “pre-mortem” exercises to identify potential sources of failure so that they can be removed or mitigated. Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure they are effective and that teams are familiar with their process. Set up regular game days to test workload and team responses to simulated events.
-
Learn from all operational failures: Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
-
Use managed services: Reduce operational burden by using AWS managed services where possible. Build operational procedures around interactions with those services.
-
Implement observability for actionable insights: Gain a comprehensive understanding of workload behavior, performance, reliability, cost, and health. Establish key performance indicators (KPIs) and leverage observability telemetry to make informed decisions and take prompt action when business outcomes are at risk. Proactively improve performance, reliability, and cost based on actionable observability data.
Best Practices
- Organization
- Prepare
- Operate
- Evolve
Security
Design Principle
-
Implement a strong identity foundation: Implement the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your AWS resources. Centralize identity management, and aim to eliminate reliance on long-term static credentials.
-
Maintain traceability: Monitor, alert, and audit actions and changes to your environment in real time. Integrate log and metric collection with systems to automatically investigate and take action.
-
Apply security at all layers: Apply a defense in depth approach with multiple security controls. Apply to all layers (for example, edge of network, VPC, load balancing, every instance and compute service, operating system, application, and code).
-
Automate security best practices: Automated software-based security mechanisms improve your ability to securely scale more rapidly and cost-effectively. Create secure architectures, including the implementation of controls that are defined and managed as code in version-controlled templates.
-
Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate.
-
Keep people away from data: Use mechanisms and tools to reduce or eliminate the need for direct access or manual processing of data. This reduces the risk of mishandling or modification and human error when handling sensitive data.
-
Prepare for security events: Prepare for an incident by having incident management and investigation policy and processes that align to your organizational requirements. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.
Best Practices
Reliability
Design Principle
There are five design principles for reliability in the cloud:
-
Automatically recover from failure: By monitoring a workload for key performance indicators (KPIs), you can start automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation of the service. This provides for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.
-
Test recovery procedures: In an on-premises environment, testing is often conducted to prove that the workload works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This approach exposes failure pathways that you can test and fix before a real failure scenario occurs, thus reducing risk.
-
Scale horizontally to increase aggregate workload availability: Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across multiple, smaller resources to verify that they don’t share a common point of failure.
-
Stop guessing capacity: A common cause of failure in on-premises workloads is resource saturation, when the demands placed on a workload exceed the capacity of that workload (this is often the objective of denial of service attacks). In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the more efficient level to satisfy demand without over- or under-provisioning. There are still limits, but some quotas can be controlled and others can be managed (see Manage Service Quotas and Constraints).
-
Manage change in automation: Changes to your infrastructure should be made using automation. The changes that must be managed include changes to the automation, which then can be tracked and reviewed.
Best Practices
Cost optimization
Design Principle
-
Implement Cloud Financial Management: To achieve financial success and accelerate business value realization in the cloud, invest in Cloud Financial Management and Cost Optimization. Your organization should dedicate time and resources to build capability in this new domain of technology and usage management. Similar to your Security or Operational Excellence capability, you need to build capability through knowledge building, programs, resources, and processes to become a cost-efficient organization.
-
Adopt a consumption model: Pay only for the computing resources that you require and increase or decrease usage depending on business requirements, not by using elaborate forecasting. For example, development and test environments are typically only used for eight hours a day during the work week. You can stop these resources when they are not in use for a potential cost savings of 75% (40 hours versus 168 hours).
-
Measure overall efficiency: Measure the business output of the workload and the costs associated with delivering it. Use this measure to know the gains you make from increasing output and reducing costs.
-
Stop spending money on undifferentiated heavy lifting: AWS does the heavy lifting of data center operations like racking, stacking, and powering servers. It also removes the operational burden of managing operating systems and applications with managed services. This permits you to focus on your customers and business projects rather than on IT infrastructure.
-
Analyze and attribute expenditure: The cloud makes it simple to accurately identify the usage and cost of systems, which then permits transparent attribution of IT costs to individual workload owners. This helps measure return on investment (ROI) and gives workload owners an opportunity to optimize their resources and reduce costs.
Best Practices
Performance efficiency
Design Principle
-
Democratize advanced technologies: Make advanced technology implementation smoother for your team by delegating complex tasks to your cloud vendor. Rather than asking your IT team to learn about hosting and running a new technology, consider consuming the technology as a service. For example, NoSQL databases, media transcoding, and machine learning are all technologies that require specialized expertise. In the cloud, these technologies become services that your team can consume, permitting your team to focus on product development rather than resource provisioning and management.
-
Go global in minutes: Deploying your workload in multiple AWS Regions around the world permits you to provide lower latency and a better experience for your customers at minimal cost.
-
Use serverless architectures: Serverless architectures remove the need for you to run and maintain physical servers for traditional compute activities. For example, serverless storage services can act as static websites (removing the need for web servers) and event services can host code. This removes the operational burden of managing physical servers, and can lower transactional costs because managed services operate at cloud scale.
-
Experiment more often: With virtual and automatable resources, you can quickly carry out comparative testing using different types of instances, storage, or configurations.
-
Consider mechanical sympathy: Understand how cloud services are consumed and always use the technology approach that aligns with your workload goals. For example, consider data access patterns when you select database or storage approaches.
Best Practices
Sustainability
Design Principle
-
Understand your impact: Measure the impact of your cloud workload and model the future impact of your workload. Include all sources of impact, including impacts resulting from customer use of your products, and impacts resulting from their eventual decommissioning and retirement. Compare the productive output with the total impact of your cloud workloads by reviewing the resources and emissions required per unit of work. Use this data to establish key performance indicators (KPIs), evaluate ways to improve productivity while reducing impact, and estimate the impact of proposed changes over time.
-
Establish sustainability goals: For each cloud workload, establish long-term sustainability goals such as reducing the compute and storage resources required per transaction. Model the return on investment of sustainability improvements for existing workloads, and give owners the resources they must invest in sustainability goals. Plan for growth, and architect your workloads so that growth results in reduced impact intensity measured against an appropriate unit, such as per user or per transaction. Goals help you support the wider sustainability goals of your business or organization, identify regressions, and prioritize areas of potential improvement.
-
Maximize utilization: Right-size workloads and implement efficient design to verify high utilization and maximize the energy efficiency of the underlying hardware. Two hosts running at 30% utilization are less efficient than one host running at 60% due to baseline power consumption per host. At the same time, reduce or minimize idle resources, processing, and storage to reduce the total energy required to power your workload.
-
Anticipate and adopt new, more efficient hardware and software offerings: Support the upstream improvements your partners and suppliers make to help you reduce the impact of your cloud workloads. Continually monitor and evaluate new, more efficient hardware and software offerings. Design for flexibility to permit the rapid adoption of new efficient technologies.
-
Use managed services: Sharing services across a broad customer base helps maximize resource utilization, which reduces the amount of infrastructure needed to support cloud workloads. For example, customers can share the impact of common data center components like power and networking by migrating workloads to the AWS Cloud and adopting managed services, such as AWS Fargate for serverless containers, where AWS operates at scale and is responsible for their efficient operation. Use managed services that can help minimize your impact, such as automatically moving infrequently accessed data to cold storage with Amazon S3 Lifecycle configurations or Amazon EC2 Auto Scaling to adjust capacity to meet demand.
-
Reduce the downstream impact of your cloud workloads: Reduce the amount of energy or resources required to use your services. Reduce the need for customers to upgrade their devices to use your services. Test using device farms to understand expected impact and test with customers to understand the actual impact from using your services.
Best Practices
- AWS Well-Architected
- Practicing Continuous Integration and Continuous Delivery on AWS
- Implementing Microservices on AWS
- Serverless Applications Lens - AWS Well-Architected Framework
- Running Containerized Microservices on AWS
- Best practice arranged by migration phase
- Best practices arranged by pillars
- Cloud provides greater flexibility in managing resources and cost
- Minimum upfront investments as customer does not have to purchase any physical infrastructure
- Provides Just in time infrastructure
- No long term contracts or commitments
- Rich Automation - Infra becomes scritable using APIs and shell
- Automatic scaling based on the load, Scaled out - adding more resources of same size, Scale in - removing the resources, scale up - increasing the size of the resources, scale down - decreasing the size of the resource
- Increased ability of software development lifecycle
- Benefits of HA (High availability) and disaster recovery
-
Cloud provides scalable architecture - cloud provides infra that has ability expand and contract depends on the load
-
Cloud infra can easily horizontally or vertically
-
Provides infinite scalability
-
Horizontal scaling - Scale up (increasing no.of web servers or nodes), Scale down (decreasing no.of web servers or nodes)
-
Vertical scaling - Scale out (increasing the processing capacity/memory/resources of server), Scale in (decreasing the processing capacity/memory/resources of server)
- Cloud has many building blocks to construct a system
- Cloud may not have exact services or components or software in place as similar to the non-cloud infra, application architecture has to support the cloud native solutions in order to maximize the cloud benefits
- Thinking about failure while designing the product, later the product becomes fail proof
- Avoid Single point of failure (Ex: hosting web app and db on same instance)
- To mitigate the single point of failure, use Load balancer environment
In this scenario also, having multiple web servers connecting with single database server causing the single point of failure..
to mitigate this, use amazon RDS database instances along with elastic load balancer where scaling is automatically included along with redundancy to avoid SPF
Leverage redundancy in terms of software/web servers/db nodes/network resources to avoid the single point of failure.
Ability of cloud to scale resources to match the demand
2 ways of scaling
- Scaling at fixed time interval
- Scaling on demand based on certain metrics, when metrics reached certain threshold; to supply the resources to fulfil the demand
A design principle to minimize the dependencies between components in order to improve the scalability of applications
Decoupling or loose coupling refers to a design principle concerned with minimizing dependencies between components in order to improve the scalability of applications.
- Loose coupling enables the applications to scale independently
Its about decreasing the latency and increasing throughput. Also talks about how important it is to utilize the cloud resources efficiently.
- Get to know about all the services, select appropriate services depends on the use cases maximize the efficiency and performance.
Security responsibility shared between customer and amazon to work together Amazon is responsible for the security OF the cloud infrastructure, physical infra, network infra Customer is responsible for the security IN the cloud such as account, user management
For IaaS
For PaaS
For SaaS
EC2 Classic (old)
Latest EC2
File system for data storage
- Snapshots are stored into s3 incrementally, and snapshots are used to restore the data in new regions/availability zones
Issues in manual scaling:
automatic scaling:
Autoscaling depends on 3 main components
1. Launch configuration (what to launch) - specifies about AMI, what ec2 configuration, security group, storage etc
3. Scaling Policy (when to launch) - defines threshold to launch the instances, to define monitoring threshold
It allows the user to monitor the resource utilization, performance, network traffic, load, set alarm notifications
It refers to automatically provisioning resources It takes care of the capacity planning, load balancing, autoscaling, application health monitoring
Its a provisioning engine to automate the infrastructure needs, the difference from the beanstalk being that, user can perform more granular level configuration in opsworks.
Scripted way of automating the deployment, using template file in json format specifications about the components/resources needed.
- Usecase such as replicating dev env to qa or staging etc
It is used to help configure and launch the required resources from the existing stack.
It is a Component service, it coordinates deployments to ec2 instances
Typical setup, but not scalable
cache with each app server instance also not idea solution
Elasticache Supports two types of cachine.
- MemCached
- Redis
- Write Through Pattern
Pros
It increases the data hit as all the data being kept in the cache, Data being updated in cache irrespective of the demand
Cons
Increases more storage as all the data kept in memory
- Lazy Load
Props
Keeping only needed data in the memory, so less memory requirement
Cons
Higher data miss rate, hence causing lower performance
CloudFront stores the resources cached locally as close to the users, when the requests comes it is routed to the least latent network edge to get the resources from the regional locations.
Objects available in s3 are highly available & durable, follows "Eventual Consistency" Model Whenever there is change in the object, there is a latency in propagating the changes to all the replicas. This causing the storage to return the objects even after delete request made.
So it is best suitable for objects that does not change much such as Archieves, videos, images
Max object size is 5TB, unlimited on no.of objects being stored. Objects can be accessed via rest api.
Extension of S3 - for data that are retrieved infrequently
Data is transitioned from s3 to glacier when ready for archived
Example: store the videos and high quality images in s3 and store the thumbnail in the RRS
- hosting static websites (https://www.linkedin.com/learning/aws-essential-training-for-architects/use-s3-for-web-application-hosting?autoSkip=true&resume=false)
- static file storage
- Versioning
- Caching
- Throttling
- Scaling
- Security
- Authentication & authorization
- Monitoring
- Functions as the unit of scale
- Abstracts the runtime
- The function
- some custom code/script that performs business logic
- Event Sources
- a trigger to execute the function, ex: trigger the function when an object is added into s3 bucket (when any bucket event is occurred)
study & expand
DynamoDB - NoSQL, Schema-less, scalable database service with low latency, high performance, high throughput
- data stored in ssd
- data automatically replicated across multiple Availability zones
It is a reliable, durable, highly scalable distributed system for passing messages between components.
Used to build loosely coupled systems (minimizing the dependencies).
To configure and coordinate the tasks in the given workflow.
Example: Tightly Coupled E-Commerce System
- Push notifications rather than pull
- Posting to a topic causes a message to send immediately
- SNS lets us push notifications where as SQS requires applications to poll constantly (pull approach)
- modes Email, Text Message
- Simple monthly calculator tool - to anaylze the services, usage to provide the cost metrics
- Get detailed billing reports by account/services/tags monthly, daily hourly
- Cost Explorer - ui to get interactive reports
- Billing Alarms - using CloudWatch and SNS to get the billing notifications whenever threshold reach
- Create Budgets
- Load Balancers
- EC2 Instances for the application/api deployments
- S3 Instances for the storage needs
- Lambda for serverless computing
- DynamoDB for nosql database requirements
- RDS for database needs
- CloudWatch for monitoring resources and alert systems
- CloudFront for CDN solutions
Fault tolerance refers to the ability of a system (computer, network, cloud cluster, etc.) to continue operating without interruption when one or more of its components fail.
Its about how the system is able to withstand the load when one or more of its components fail.
The objective of creating a fault-tolerant system is to prevent disruptions arising from a single point of failure, ensuring the high availability and business continuity of mission-critical applications or systems.
Its about avoiding loss of service by ensuring enough number of resources available to serve the load.
High availability refers to a system’s ability to avoid loss of service by minimizing downtime. It’s expressed in terms of a system’s uptime, as a percentage of total running time. Five nines, or 99.999% uptime, is considered the “holy grail” of availability.
Frequently asked questions topics
- AWS Well-Architected helps cloud architects build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads.