AWS DevOps - keshavbaweja-git/guides GitHub Wiki

CloudWatch

  • CloudWatch Logs aren't a valid target in CloudWatch Event Rules.
  • ASG Target Tracking Policy does not support Average Memory Utilization as a default metric, as this metric is not published to CloudWatch. Supported metrics - Avg. CPU utilization, Avg. network in/out, ALB request counts/target.
  • Billing Alarm
    1. Enable Billing Alerts, to monitor AWS estimated charges
    2. Create billing alarm using metric EstimatedCharges
    3. Configure billing alarm to publish a notification to SNS when EstimatedCharges is above threshold
    4. Create email subscriptions against the SNS topic
  • CloudWatch metric retention schedule
    • Resolution < 60 seconds: 3 hours
    • Resolution >= 60 seconds: 15 days
    • Resolution > 5 mins: 63 days
    • Resolution > 1 hours: 15 months
  • CloudWatch metric minimum resolution = 1 second
  • CloudWatch metrics can't be deleted by users, they expire based on retention schedule
  • Use CloudWatch Unified Agent to stream logs from EC2 instances to CloudWatch Logs
  • CloudWatch supports AWS Console Sign-in as an event source. This source supports
    • Console sign in events
    • AWS API calls via CloudTrail
  • To visualize and alarm based on failed login attempts to EC2 instances
    • Install CloudWatch Unified agent on EC2 instances
    • Stream logs to CloudWatch Logs
    • Create a metric filter based on log events
    • Visualize and create an alarm based on metric
  • AWS Lambda doesn’t provide a built-in metric for memory usage but you can set up a CloudWatch metric filter. You can set up a custom CloudWatch metric filter using a pattern that includes 'REPORT', 'MAX', 'Memory' and 'Used:'.
  • Cross-region is now automatically built into CloudWatch dashboard.
  • You can add cross-account functionality to your CloudWatch console. Cross-account functionality is integrated with AWS Organizations, to help you efficiently build your cross-account dashboards. Sharing account needs to set up sharing with one monitoring account or an organization id.
  • CloudWatch service-linked roles
    • AWSServiceRoleForCloudWatchEvents - EC2 actions
    • AWSServiceRoleForCloudWatchAlarms_ActionSSM - SSM OpsCenter actions
    • AWSServiceRoleForCloudWatchAlarms_ActionSSMIncidents - SSM Incident Manager actions
    • AWSServiceRoleForCloudWatchCrossAccount - cross-account, cross-region access
  • CW Logs can be streamed to Kinesis in a different account via a subscription filter.
  • Monitoring based on logging CloudWatch Agent => CloudWatch Log Group, Stream and Events => Metric Filter => Alarm => SNS => Email
  • When metric data points are produced at a low frequency, configure CloudWatch alarms to ignore missing data points. This avoid INSUFFICIENT_DATA alarm state.
  • You can use subscriptions to get access to a real-time feed of log events from CloudWatch Logs and have it delivered to other services such as an Amazon Kinesis stream, an Amazon Kinesis Data Firehose stream, Amazon OpenSearch/ES service or AWS Lambda for custom processing, analysis, or loading to other systems. When log events are sent to the receiving service, they are Base64 encoded and compressed with the gzip format. To begin subscribing to log events, create the receiving resource, such as a Kinesis stream, where the events will be delivered. A subscription filter defines the filter pattern to use for filtering which log events get delivered to your AWS resource, as well as information about where to send matching log events to. Each Log Group can have up to two subscription filters associated with it.
  • ECS when used as an Event Source while configuring a CloudWatch Event Rule offers following event types
    • ECS Task State Change
    • ECS Container Instance Change
  • Permissions can be applied to CloudWatch default event bus to receive events from another AWS Account, or another AWS Organization, or all AWS accounts.
  • Start CloudWatch Agent on Windows env - amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a start

EBS

  • Automate snapshot creation of EBS volume => CloudWatch event rule with event source: Schedule, event target: EC2 CreateSnapshot API call with EBS Volume Id as parameter.
  • How to encrypt an attached EBS volume?
    • Option 1 - Create encrypted destination volume. Attach destination volume to the EC2 instance which has un-encrypted source volume attached. Bulk copy files from source volume to destination volume.
    • Option 2 - Create a snapshot of un-encrypted volume. The created snapshot is un-encrypted. Copy the snapshot with encryption configurations specified to create an encrypted snapshot. Mount the new encrypted snapshot as a new encrypted volume.
  • EBS Volume Initialization Empty EBS volumes receive their maximum performance the moment that they are created and do not require initialization (formerly known as pre-warming). For volumes that were created from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.
  • EBS snapshots are stored in S3.

Misc

  • There are service limits on AWS Config Aggregator.
  • AWS Support Plans - Basic, Developer, Business, Enterprise
  • A SSL certificate can be attached to a Cloudfront distribution. CloudFront distribution supports wildcard certificates.
  • Access logging is an optional feature of ELB which is disable by default. After you enable access logging, ELB captures the logs are stores them in S3 bucket that you specify.
  • You can use Amazon SNS to send notification messages to one or more HTTP or HTTPS endpoints. When you subscribe an endpoint to a topic, you can publish a notification to the topic and Amazon SNS sends an HTTP POST request delivering the contents of the notification to the subscribed endpoint. When you subscribe the endpoint, you choose whether Amazon SNS uses HTTP or HTTPS to send the POST request to the endpoint.
  • SNS supports SNI, Basic and Digest Access Authentication for HTTPS subscriber endpoints
  • Enable MFA delete on a versioned S3 bucket to provide additional layer of protection.
  • Control access to S3 bucket by defining a bucket resource policy that limits access to specified IAM roles.
  • By default, all AWS accounts are limited to five (5) Elastic IP addresses per Region. If you think your architecture warrants additional Elastic IP addresses, you can request a quota increase directly from the AWS Service Quotas console.
  • IAM Access Advisor - enables retrieval of service-last-accessed information for OU via console and APIs.
  • Dedicated Host - Provides visibility of the number of sockets and physical cores. Allows you to consistently deploy your instances to the same physical server over time. Provides additional visibility and control over how instances are placed on a physical server. Support BYOL.
  • The AWS Lambda console provides monitoring graphs for Invocations, Duration, Error count and success rate (%), Throttles, IteratorAge and DeadLetterErrors.
  • AWS Macie uses ML to analyze, classify and protect data in S3. Data sources for Macie -
    • AWS CloudTrail
    • Amazon S3 Bucket
  • QuickSight uses ML for analyze data from various relational data stores and S3 to produce dashboards. YAML is not a supported format.
  • AWS Personal Health Dashboard in not integrated with SNS notifications. Use CloudWatch event rule with an event source of AWS PHB event and target of SNS.
  • Use AWS and user defined tags as cost allocation tags for cost allocation in Cost Explorer dashboard and reports.
  • AWS Directory Service AD Connector redirects directory requests to on-premises Microsoft AD without caching any information in the cloud.
  • By default, AWS Lambda limits the total concurrent executions across all functions in a region to 1000.
  • ELB CW metrics for LB
    • ActiveConnectionCount
    • ConsumedLCUs (Load balancer capacity units)
    • ProcessedBytes - Number of bytes processed by load balancer for IPv4 and IPv6
    • Note NetworkIn is an EC2 metric not ELB metric
  • Amazon EC2 instances have three different virtual network adapters, VIF, Intel 82599 VF, and Elastic Network Adapter (ENA).
  • 10 Gbps between instances: Cluster Placement Group + Enhanced networking instance type
  • 25 Gbps between instances: Cluster Placement Group + Enhanced networking and ENA compatible instance type.
  • On-premises SAML 2.0 compliant identity provider (like Microsoft Active Directory Federation Service) + AWS SSO endpoint => Federate On-premises users into AWS management console. https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_enable-console-saml.html
  • CF stack deletion - some resources must be empty before they can be deleted. E.g. S3 Bucket, EC2 Security Group should not be associated with any EC2 instance.
  • Security groups can't be attached to NLBs.
  • aws:PrincipalTag - Tags that exist on the user or role making the call.
  • iam:ResourceTag - Tags that exist on an IAM resource.
  • StringEquals: { "ec2:ResourceTag/environment": "production" }
  • StartConfigRulesEvaluation - AWS Config API that can be invoked at the rate of 1/min
  • S3 bucket resource policy to deny upload of un-encrypted objects: "s3:x-amz-server-side-encryption": "aws:kms"
  • S3 bucket resource policy to deny access of objects over non-secure channel: "aws:SecureTransport": "false"
  • Unlike an identity-based policy, a resource-based policy specifies who (which principal) can access that resource.
  • DR Strategies
    • Backup & restore, Pilot Light, Warm Standby, Active-Active.
    • Warm Standby: creates a fully functional second environment. However, the second environment uses minimum number of servers to reduce cost and does not serve traffic until failover.
  • AWS::EC2::PlacementGroup
    • Cluster
    • Partition
    • Spread
  • Route 53 routing policies
    • Weighted for Blue/Green
    • Weighted for Canary
    • Latency for DR

RDS

  • Aurora endpoint types
    • Cluster
    • Reader
    • Custom
    • Instance
  • RDS Aurora Postgres offers lower replication lag than RDS Postgres
  • RDS Aurora Postgres offers 15 read replicas, RDS Postgres offers 5 read replicas
  • RDS Multi-AZ DB instance - Amazon RDS automatically creates a Primary DB instance and synchronously replicates data to a standby instance in different AZ.
  • RDS automated snapshots cannot be shared with other accounts or regions
  • Creating a copy of an RDS automated snapshot creates a manual snapshot which can be shared with other accounts and regions.
  • AWS::RDS::DBInstance.StorageEncrypted (boolean, default is false)
  • Amazon Aurora global databases
    • Span multiple AWS Regions
    • Enable low latency global reads
    • Provide fast recovery from region outage
    • Consists of one primary AWS Region where data is mastered and up to 5 read-only secondary regions
    • AWS replicates data to secondary AWS Regions over dedicated infrastructure with latency under a second.
    • Use cluster storage volume and not the database engine for replication
    • Failover types
      • Managed planned failover for DR testing
      • Unplanned failover in case of a region outage
    • During a managed planned over, the chosen secondary DB cluster does not inherit configuration options of primary cluster. Ensure following configuration options are kept same in the chosen secondary cluster
      • Parameter group
      • CloudWatch Events and Alarms
      • Integration with other AWS services

Step Functions

  • Error handling in Step Functions - define Retry which is an array of Retriers. A Retries is defined as
    • ErrorEquals (an array of error names)
    • IntervalSeconds (seconds to wait before first retry attempt)
    • MaxAttempts
    • BackoffRate
  • Step Functions is designed to run workflows that have a finite duration and number of steps. Executions have a maximum duration of 1 year and a maximum of 25,000 events.
  • Step Functions best practices
    • Use timeouts to avoid stuck executions
    • Use Amazon S3 ARNs instead of passing large payloads
    • Avoid reaching history quota
    • Handle Lambda service exceptions
    • Avoid latency when polling for activity tasks
      • Have at least 100 open polls per activity ARN at each point in time
      • Implement pollers as separate threads in activity worker implementation
    • Choosing Standard or Express Workflows
    • Amazon CloudWatch Logs resource policy size restrictions
  • Activities are an AWS Step Functions feature that enables you to have a task in your state machine where the work is performed by a worker that can be hosted on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), mobile devices—basically anywhere.
  • In AWS Step Functions, activities are a way to associate code running somewhere (known as an activity worker) with a specific task in a state machine. You can create an activity using the Step Functions console, or by calling CreateActivity. This provides an Amazon Resource Name (ARN) for your task state.
  • An activity worker can any application that can make an HTTP connection, hosted anywhere. When Step Functions reaches an activity task state, the workflow waits for an activity worker to poll for a task. An activity worker polls Step Functions by using GetActivityTask, and sending the ARN for the related activity. GetActivityTask returns a response including input (a string of JSON input for the task) and a taskToken (a unique identifier for the task). After the activity worker completes its work, it can provide a report of its success or failure by using SendTaskSuccess or SendTaskFailure. These two calls use the taskToken provided by GetActivityTask to associate the result with that task.
  • AWS Step Functions offers Standard Workflows as the default workflow type, with the option to choose Express Workflows. You can choose Standard Workflows when you need long-running, durable, and auditable workflows, or Express Workflows for high-volume, event processing workloads. Your state machine executions will behave differently, depending on which Type you select. The Type you choose cannot be changed after your state machine has been created. CloudWatch Logs resource policies are limited to 5120 characters. When CloudWatch Logs detects that a policy approaches this size limit, it automatically enables log groups that start with /aws/vendedlogs/.
  • When you create a state machine with logging enabled, Step Functions must update your CloudWatch Logs resource policy with the log group you specify. To avoid reaching the CloudWatch Logs resource policy size limit, prefix your CloudWatch Logs log group names with /aws/vendedlogs/. When you create a log group in the Step Functions console, the log group names are prefixed with /aws/vendedlogs/states. For more information, see Enabling Logging from Certain AWS Services.
  • When a Step Functions state reports an error, the default course of action for AWS Step Functions is to log the error and perform a single retry after 1 second. If that doesn't succeed, AWS Step Functions will fail the execution entirely.

Secrets Manager

  • Secrets Manager quotas by region
    • Secrets: 40,000
    • Versions of a secret: ~100
    • Staging Labels attached across all versions of a secret: 20
    • Versions attached to a label at the same time: 1
    • Length of a secret value: 65,536 bytes
    • Length of resource-based policy: 20,480 characters
    • DescribeSecret + GetSecretValue: 5,000/sec
  • Replicate Secret Manager secret across regions
    • Secret Manger rotates a secret in source AWS Region
    • CloudTrail receives a log event with eventName: "RotationSuceeded"
    • CloudTrail log event triggers a CloudWatch rule with a Lambda function as target
    • Lambda performs PutSecretValue in destination AWS Region

SAM

  • SAM Deployment Preference - Type
    • Canary - Traffic is shifted in two increments
    • Linear - Traffic is shifted in equal increments with equal intervals
    • All-at-once
  • SAM Deployment Preference - Alarms
    • A list of alarms that are monitored. During deployment, if any of these alarms is triggered deployment is rolled back.
  • SAM Deployment Preference - Hooks
    • PostTraffic - Lambda function that is invoked post traffic shifting. Can be used to run integration tests. Must return Success/Failure to CodeDeploy. CodeDeploy invokes hook function asynchronously.
    • PreTraffic - Lambda function to test new Lambda version before any traffic has been shifted. If it returns Success, SAM deployment proceeds with traffic shifting.

X-Ray

  • The AWS X-Ray daemon is a software application that listens for traffic on UDP port 2000, gathers raw segment data, and relays it to the AWS X-Ray API. The daemon works in conjunction with the AWS X-Ray SDKs and must be running so that data sent by the SDKs can reach the X-Ray service.
  • On AWS Lambda and AWS Elastic Beanstalk, use those services' integration with X-Ray to run the daemon. Lambda runs the daemon automatically any time a function is invoked for a sampled request. On Elastic Beanstalk, use the XRayEnabled configuration option to run the daemon on the instances in your environment.
  • To run the X-Ray daemon locally, on-premises, or on other AWS services, download it, run it, and then give it permission to upload segment documents to X-Ray.

Kinesis

  • Kinesis Data Firehose has Lambda blueprint for Syslog->Json.
  • Kinesis Data Firehose targets: S3, RedShift, Elasticsearch, Splunk
  • You should always use the Kinesis Producer Library (KPL) when writing code for Kinesis where possible, due the Performance Benefits, Monitoring and Asynchronous performance. Also check GetShardIterator, CreateStream and DescribeStream Service Limits. Use PutRecords operation to publish multiple records in a batch for higher throughput.
  • Records being read slowly off Kinesis Stream - this could be caused if maxRecords value for GetRecords call is set low. Other reason could be inefficient/slow consumer logic that processes records batch.
  • Records skipped while being processed by a consumer - processRecords calls have unhandled exceptions.
  • Amazon Kinesis stores your data for up to 24 hours by default. You can raise data retention period to up to 7 days by enabling extended data retention or up to 365 by enabling long-term data retention using the console, the CLI or the API call.

AWS SSO + Microsoft AD

  • Ensure the number of AWS SSO permission sets are less than 500 and you have no more than 1500 AD groups.
  • AWS Organisations and the AWS Managed Microsoft AD must be in the same account and the same region.
  • You can use the User Principal Name (UPN) or the DOMAIN\UserName format to authenticate with AD, but you can't use the UPN format if you have two-step verification and Context-aware verification enabled.

Elastic Beanstalk

  • Elastic Beanstalk environment configuration precedence during environment creation -
    1. Settings applied directly on Console/CLI/API
    2. Settings from saved configuration file, if specified under .elasticbeanstalk/saved_configs
    3. Configuration file under .ebextensions folder
    4. Default values
  • Elastic Beanstalk environment types
    • Load-balanced, scalable
    • Single instance
    • Worker environment - SQS Queue + Worker Daemon
  • Elastic Beanstalk Deployment preferences
    • Ignore health check
    • Command timeout
    • Healthy threshold
  • Elastic Beanstalk commands You can use the commands key to execute commands on the EC2 instance. The commands run before the application and web server are set up and the application version file is extracted.The specified commands run as the root user, and are processed in alphabetical order by name. By default, commands run in the root directory. To run commands from another directory, use the cwd option. To troubleshoot issues with your commands, you can find their output in instance logs.
  • Elastic Beanstalk services You can use the services key to define which services should be started or stopped when the instance is launched. The services key also allows you to specify dependencies on sources, packages, and files so that if a restart is needed due to files being installed, Elastic Beanstalk takes care of the service restart.
  • Elastic Beanstalk container commands Container commands run after the application and web server have been set up and the application version archive has been extracted, but before the application version is deployed. The specified commands run as the root user, and are processed in alphabetical order by name. Container commands are run from the staging directory, where your source code is extracted prior to being deployed to the application server. Any changes you make to your source code in the staging directory with a container command will be included when the source is deployed to its final location. You can use leader_only to only run the command on a single instance, or configure a test to only run the command when a test command evaluates to true. Leader-only container commands are only executed during environment creation and deployments, while other commands and server customization operations are performed every time an instance is provisioned or updated. Leader-only container commands are not executed due to Launch Configuration changes, such as a change in the AMI Id or instance type.
  • Elastic Beanstalk configurations are stored as .ebextensions/abc.config
  • Elastic Beanstalk saved configurations are stored as .elasticbeanstalk/saved_configurations/
  • Elastic BeanStalk Managed Updates - minor and patch version updates, can be configured as below
    • To automatically upgrade to the latest version of a platform during a scheduled maintenance window.
    • Application remains in service with no reduction in capacity
    • Available on both single-instance and load-balanced environments.
  • EB Instance deployment workflow
    1. Download application => Run commands in config file => Extract application => Run prebuild hooks in .platform/hooks/prebuild/*
    2. Configure app and proxy: Buildfile => Configure proxy overrides: .platform/nginx/* => container_commands in config file => Run predeploy hooks: .platform/hooks/predeploy/*
    3. Deploy/flip: Procfile => Proxy overrides take effect => Run postdeploy hooks: .platform/hooks/postdeploy/*

ASG

  • The health status of an Auto Scaling instance is either healthy or unhealthy. All instances in your Auto Scaling group start in the healthy state. Instances are assumed to be healthy unless Amazon EC2 Auto Scaling receives notification that they are unhealthy. This notification can come from one or more of the following sources: Amazon EC2, Elastic Load Balancing (ELB), or a custom health check. After Amazon EC2 Auto Scaling marks an instance as unhealthy, it is scheduled for replacement. If you do not want unhealthy instances to be replaced, you can suspend the ReplaceUnhealthy process for any individual Auto Scaling group. For details, see Suspending and resuming a process for an Auto Scaling group. To provide enough time for new instances to be ready to start serving application traffic without being terminated due to failed health checks, set the health check grace period of the group to match the expected startup period of your application. For more information, see Health check grace period.
  • The AWS::AutoScaling::AutoScalingGroup resource uses the UpdatePolicy attribute to define how an Auto Scaling group resource is updated when the AWS CloudFormation stack is updated. If you don't have the correct settings configured for the UpdatePolicy attribute, your rolling update can produce unexpected results.
  • AutoScalingRollingUpdate policy controls how AWS CloudFormation handles rolling updates for an Auto Scaling group. This common approach keeps the same Auto Scaling group, and then replaces the old instances based on the parameters that you set.
  • AutoScalingReplacingUpdate policy enables you to specify whether AWS CloudFormation replaces an Auto Scaling group with a new one or replaces only the instances in the Auto Scaling group.
  • ASG lifecycle hooks - Pending:Wait, Pending:Proceed, Terminating:Wait, Terminating:Proceed
  • Default ASG termination policy selects an instance for termination as below -
    • Balance no. of instances in AZs, select an instance that is not protected from scale in from AZ with highest no. of instances.
    • Terminate the instance that was launched with oldest Launch Template or Launch Configuration
    • Terminate the instance closest to next billing hour
  • ASG Process: AddToLoadBalancer, when this process is suspended, Amazon EC2 Auto Scaling launches instances but does not add them to the load balancer target group or Classic Load Balancer. When you resume the AddToLoadBalancer process, it resumes adding instances to the load balancer when they are launched. However, it does not add the instances that were launched while this process was suspended. You must register those instances manually.
  • AutoScaling cool down period is a configurable setting that helps ensure that AutoScaling doesn't launch or terminate additional instances before the previous scaling activity takes effect.
  • AWS::AutoScaling::AutoScalingGroup.NotificationConfigurations is an array of NotificationConfiguration with attributes as below. NotificationConfiguration specifies the events that the Amazon EC2 Auto Scaling group sends notifications for.
    • NotificationTypes
      • autoscaling:EC2_INSTANCE_LAUNCH
      • autoscaling:EC2_INSTANCE_LAUNCH_ERROR
      • autoscaling:EC2_INSTANCE_TERMINATE
      • autoscaling:EC2_INSTANCE_TERMINATE_ERROR
      • autoscaling:TEST_NOTIFICATION
    • TopicARN
  • AWS::AutoScaling::AutoScalingGroup.LifecycleHookSpecificationList is an array of LifecycleHookSpecification. A lifecycle hook specifies actions to perform when Amazon EC2 Auto Scaling launches or terminates instances.
    • LifecycleTransition
      • autoscaling:EC2_INSTANCE_LAUNCHING
      • autoscaling:EC2_INSTANCE_TERMINATING
    • NotificationTargetARN
  • AWS AutoScaling supports following event types as a CloudWatch event source
    • Instance launch and terminate
      • "EC2 Instance Launch Successful",
      • "EC2 Instance Terminate Successful",
      • "EC2 Instance Launch Unsuccessful",
      • "EC2 Instance Terminate Unsuccessful",
      • "EC2 Instance-launch Lifecycle Action",
      • "EC2 Instance-terminate Lifecycle Action"
    • Instance refresh
      • "EC2 Auto Scaling Instance Refresh Checkpoint Reached",
      • "EC2 Auto Scaling Instance Refresh Started",
      • "EC2 Auto Scaling Instance Refresh Succeeded",
      • "EC2 Auto Scaling Instance Refresh Failed",
      • "EC2 Auto Scaling Instance Refresh Cancelled"
  • Refreshing instances in AWS AutoScaling Group
    • UpdatePolicy in CloudFormation stack
    • CodeDeploy
    • StartInstanceRefresh API call built in to AutoScaling service
  • ASG LaunchConfiguration is immutable. Clone to create a new version.
  • ASG LaunchTemplate is versioned, LaunchConfiguration is not.
  • ASG LaunchTemplate allows to specify a percentage of On-Demand/Spot instances, LaunchConfiguration does not.
  • Scaling policy types
    • Dynamic scaling policy
      • Target tracking
      • Step scaling (scale in multiple steps based on an Active CW Alarm)
      • Simple (scale based on an Active CW Alarm)
    • Predictive scaling policy: Scales based on predicted usage. Provide metric to be monitoed and target utilization as inputs to policy.
    • Scheduled action: Provide Desired/Min/Max capacity with a once-off or recurring schedule.
  • Warm pool
    • Warm pool instance state: Stopped | Running
    • Warm pool size
    • Max prepared capacity: ASG Max | Set number of instances
  • An instance refresh allows you to trigger a rolling replacement of instances in the Auto Scaling group with a new group of instances. You can optionally add checkpoints to replace instances in phases and perform verifications on your instances at specific points.

ASG - Instance scale in protection

  • To control whether an Auto Scaling group can terminate a particular instance when scaling in, use instance scale-in protection. You can enable the instance scale-in protection setting on an Auto Scaling group or on an individual Auto Scaling instance. When the Auto Scaling group launches an instance, it inherits the instance scale-in protection setting of the Auto Scaling group. You can change the instance scale-in protection setting for an Auto Scaling group or an Auto Scaling instance at any time. Instance scale-in protection starts when the instance state is InService. If you detach an instance that is protected from scale-in, its instance scale-in protection setting is lost. When you attach the instance to the group again, it inherits the current instance scale-in protection setting of the group. If all instances in an Auto Scaling group are protected from termination during scale in, and a scale-in event occurs, its desired capacity is decremented. However, the Auto Scaling group can't terminate the required number of instances until their instance scale-in protection settings are disabled.

ASG - Standby

  • You can put an instance that is in the InService state into the Standby state, update or troubleshoot the instance, and then return the instance to service. Instances that are on standby are still part of the Auto Scaling group, but they do not actively handle load balancer traffic. This feature helps you stop and start the instances or reboot them without worrying about Amazon EC2 Auto Scaling terminating the instances as part of its health checks or during scale-in events. By default, the value that you specified as your desired capacity is decremented when you put an instance on standby. This prevents the launch of an additional instance while you have this instance on standby. Alternatively, you can specify that your desired capacity is not decremented. If you specify this option, the Auto Scaling group launches an instance to replace the one on standby. The intention is to help you maintain capacity for your application while one or more instances are on standby.

SQS

  • SQS message size limit is 256 KB.
  • You can use the Amazon SQS Extended Client Library for Java to manage Amazon SQS messages using Amazon S3 only with the AWS SDK for Java. You can't do this with the AWS CLI, the Amazon SQS console, the Amazon SQS HTTP API, or any of the other AWS SDKs. This is especially useful for storing and consuming messages up to 2 GB. You can use the Amazon SQS Extended Client Library for Java to do the following:
    • Specify whether messages are always stored in Amazon S3 or only when the size of a message exceeds 256 KB.
    • Send a message that references a single message object stored in an S3 bucket.
    • Retrieve the message object from an S3 bucket.
    • Delete the message object from an S3 bucket

Jumbo frames

  • The maximum transmission unit (MTU) of a network connection is the size, in bytes, of the largest permissible packet that can be passed over the connection. The larger the MTU of a connection, the more data that can be passed in a single packet. Ethernet packets consist of the frame, or the actual data you are sending, and the network overhead information that surrounds it.
  • Ethernet frames can come in different formats, and the most common format is the standard Ethernet v2 frame format. It supports 1500 MTU, which is the largest Ethernet packet size supported over most of the internet. The maximum supported MTU for an instance depends on its instance type. All Amazon EC2 instance types support 1500 MTU, and many current instance sizes support 9001 MTU, or Jumbo frames.
  • Jumbo frames allow more than 1500 bytes of data by increasing the payload size per packet, and thus increasing the percentage of the packet that is not packet overhead. Fewer packets are needed to send the same amount of usable data. However, traffic is limited to a maximum MTU of 1500 in the following cases:
    • Traffic over an internet gateway
    • Traffic over an inter-region VPC peering connection
    • Traffic over VPN connections
    • Traffic outside of a given AWS Region for EC2-Classic If packets are over 1500 bytes, they are fragmented, or they are dropped if the Don't Fragment flag is set in the IP header. Jumbo frames should be used with caution for internet-bound traffic or any traffic that leaves a VPC. Packets are fragmented by intermediate systems, which slows down this traffic. To use jumbo frames inside a VPC and not slow traffic that's bound for outside the VPC, you can configure the MTU size by route, or use multiple elastic network interfaces with different MTU sizes and different routes.
  • For instances that are collocated inside a cluster placement group, Jumbo frames help to achieve the maximum network throughput possible, and they are recommended in this case.
  • You can use Jumbo frames for traffic between your VPCs and your on-premises networks over AWS Direct Connect.

DynamoDB

  • DynamoDB GSI allows to define an index with a partition key different from the one defined on main table. GSI does not share throughput with main table.
  • DynamoDB Global Table is a fully managed multi-region, multi-master table.
  • DAX is fully managed, highly available, in-memory cache that has a 10X performance improvement for DynamoDB.
  • DynamoDB Streams can have at most two active consumers.

CloudFormation best practices

  • Planning and organizing
    • Organize your stacks by lifecycle and ownership
    • Use cross-stack references to export shared resources
    • Use IAM to control access
    • Reuse templates to replicate stacks in multiple environments
    • Verify quotas for all resource types
    • Use modules to reuse resource configurations
  • Managing stacks
    • Manage all stack resources through AWS CloudFormation
    • Create change sets before updating your stacks
    • Use stack policies
    • Use AWS CloudTrail to log AWS CloudFormation calls
    • Use code reviews and revision controls to manage your templates
    • Update your Amazon EC2 Linux instances regularly
  • Stack Policy When you create a stack, all update actions are allowed on all resources. By default, anyone with stack update permissions can update all of the resources in the stack. You can prevent stack resources from being unintentionally updated or deleted during a stack update by using a stack policy. A stack policy is a JSON document that defines the update actions that can be performed on designated resources. After you set a stack policy, all of the resources in the stack are protected by default. To allow updates on specific resources, you specify an explicit Allow statement for those resources in your stack policy. You can define only one stack policy per stack, but, you can protect multiple resources within a single policy. A stack policy applies to all AWS CloudFormation users who attempt to update the stack. You can't associate different stack policies with different users. A stack policy applies only during stack updates. It doesn't provide access controls like an AWS Identity and Access Management (IAM) policy.
  • Use the aws cloudformation create-stack command with the --stack-policy-body option to type in a modified policy or the --stack-policy-url option to specify a file containing the policy.

AWS Data Pipeline

  • AWS Data Pipeline data stores
    • DynamoDB
    • RDS
    • S3
    • Redshift
  • AWS Data Pipeline compute services
    • EC2
    • EMR

Amazon Inspector

  • Amazon Inspector is an AWS tool to perform security assessments of managed instances using AWS managed rules packages. To configure Inspector
    • Install SSM agent
    • Instance role needs permission to execute "Run Command"
    • Install Inspector agent using SSM Run Command
  • Rules packages
    • Network reachability
    • Common vulnerabilities and exposures CVEs
    • Center for Internet Security (CIS) benchmarks
    • Security best practices for Amazon Inspector
      • Disable root login over SSH
      • Disable password authentication over SSH
      • Support SSHv2 only
      • Enable Address Space Layout Randomization
      • Enable Data Execution Prevention
  • Assessment template
    • Rules package
    • Duration
    • SNS topic for notifications
    • Tags to apply on findings
  • Assessment report: can be generated via Console or API call GetAssessmentReport
  • CloudFormation resources
    • AWS::Inspector::AssessmentTarget
    • AWS::Inspector::AssessmentTemplate
    • AWS::Inspector::ResourceGroup
  • When Security Hub and Inspector are both enabled in an account, Findings from an Inspector run are automatically send to Security Hub.

AWS Systems Manager

  • IAM Action required to allow execution of SSM documents (i.e. RunCommand) - ssm:SendCommand
  • Patch Manager supports customized Patch Baseline with multiple Patch Groups associated to it. These Patch Groups can then to individually scheduled for Patching by creating Maintenance Windows.
  • On-Premises Managed Instances are prefixed with mi- in SSM Fleet Manager
  • SSM Hybrid Activation - register On-Premises server with SSM

ASG lifecycle actions

  • ASG lifecycle hooks - Instances can remain in wait state for a finite period of time. Default is one hour. You can adjust how long the timeout period lasts in the following ways:
    • Set the heartbeat timeout for the lifecycle hook when you create the lifecycle hook. With the put-lifecycle-hook command, use the --heartbeat-timeout option.
    • Continue to the next state if you finish before the timeout period ends, using the complete-lifecycle-action (CONTINUE|ABANDON) command.
    • Postpone the end of the timeout period by recording a heartbeat, using the record-lifecycle-action-heartbeat command. This extends the timeout period by the timeout value specified when you created the lifecycle hook. For example, if the timeout value is one hour, and you call this command after 30 minutes, the instance remains in a wait state for an additional hour, or a total of 90 minutes.
    • The maximum amount of time that you can keep an instance in a wait state is 48 hours or 100 times the heartbeat timeout, whichever is smaller.

AWS Server Migration Service

  • AWS Server Migration Service (AWS SMS) automates the migration of your on-premises VMware vSphere, Microsoft Hyper-V/System Center Virtual Machine Manager (SCVMM), and Azure virtual machines to the AWS Cloud. AWS SMS incrementally replicates your server VMs as cloud-hosted Amazon Machine Images (AMIs) ready for deployment on Amazon EC2. Working with AMIs, you can easily test and update your cloud-based images before deploying them in production.
  • Use of AWS SMS is limited as follows:
    • 50 concurrent VM migrations per account, unless a customer requests a limit increase.
    • 90 days of service usage per VM (not per account), beginning with the initial replication of a VM. We terminate an ongoing replication after 90 days unless a customer requests a limit increase.
    • 50 concurrent application migrations per account, with a limit of 10 groups and 50 servers in each application.
  • AWS Server Migration Service supports the automated migration of multi-server application stacks from your on-premises data center to Amazon EC2. Where server migration is accomplished by replicating a single server as an Amazon Machine Image (AMI), application migration replicates all of the servers in an application as AMIs and generates an AWS CloudFormation template to launch them in a coordinated fashion. Applications can be further subdivided into groups that allow you to launch tiers of servers in a defined order. After your servers are organized into applications and launch groups, you can specify a replication frequency, provide configuration scripts, and configure a target VPC in which to launch them. When you launch an application, AWS SMS configures it based on the generated template.

CloudFormation update rollback failed

  • When a CloudFormation stack reaches UPDATE_ROLLBACK_FAILED, this means that the CloudFormation stack was attempting an UPDATE operation, the operation failed, and we began a rollback. An issue occurred that stopped CloudFormation from returning to the previous “good” state during the rollback. As a result, the stack can’t update and can’t roll back, thus it assumes this half-way state. The API then stops any further actions on the stack other than ContinueUpdateRollback and DeleteStack.
  • When a CloudFormation stack update failure occurs and the stack enters an UPDATE_ROLLBACK_FAILED state, the API operation simply continues the rollback. However, it provides no fix to the underlying issue. For example, it doesn’t fix the underlying issue of whether a rollback failed due to an account limit. If you don’t address this account limit constraint, ContinueUpdateRollback will simply retry the rollback and fail once more. You still need to fix the underlying problem separately and, in a break from best-practices, outside of AWS CloudFormation.
  • To address issues with CloudFormation stacks that have entered UPDATE_ROLLBACK_FAILED state you have three options:
    1. Delete the stack. If the deletion fails for any reason, you can then use the DeleteStack API operation with the RetainResources option listing resources that failed deletion.
    2. Make underlying account changes manually/outside the scope of the stack to re-synchronize the stack with the expectation and then perform ContinueUpdateRollback.
    3. If you address the issues with the underlying stack resources you can use ContinueUpdateRollback along with the ResourcesToSkip option. CloudFormation will mark the problematic/failing resources as UPDATE_COMPLETE and continue with the rest of the rollback.

Multi account structure

  • AWS Control Tower helps set up and govern a new, secure, multi-account AWS environment.
  • AWS Organizations with Service Control Policies provide guardrails in account administration.

CloudFormation

  • CloudFormation intrinsic functions - You can use intrinsic functions only in specific parts of a template. Currently, you can use intrinsic functions in resource attributes, outputs, metadata attributes, and update policy attributes. You can also use intrinsic functions to conditionally create stack resources.
    • Fn::GetAtt returns the value of an attribute from a resource in the template.
    • Ref returns the value of the specified parameter or resource.
    • Fn::FindInMap returns the value corresponding to keys in a two-level map that's declared in the Mappings section.
    • Fn::GetAZs returns an array that lists Availability Zones for a specified region in alphabetical order.
    • Fn::ImportValue returns the value of an output exported by another stack.
    • Outputs.Export.Name specifies a name for an Output that can be imported in another stack.
  • AWS::StepFunctions::StateMachine - Step Functions
  • AWS::CloudFormation::Stack - CloudFormation Nested Stacks
  • Update behaviors of stack resources
    • Update with no interruption
    • Update with some interruption
    • Replacement
  • You can use CloudFormation StackSet to deploy a stack in different AWS accounts under one OU.
  • validate-template: CLI command to validate CF template

AWS Config

  • Enable automatic logging of WAF ACLS by using AWS Config. https://aws.amazon.com/blogs/security/enable-automatic-logging-of-web-acls-by-using-aws-config/
  • Configuration of AWS resource changed => Polled and identified by AWS Config => Custom config rule, event is normalized, persisted to S3 and notified to SNS => Lambda => Compliant/Non-compliant => list of non-compliant resources. CW schedule event => Lambda => Query config rules for list of non-compliant resources => SNS notification.
  • SNS notifications produced by AWS Config
    • ComplianceChangeNotification
    • ConfigurationItemChangeNotification
    • ConfigRulesEvaluationStarted
    • ConfigurationSnapshotDeliveryCompleted: deliver-config-snapshot
    • ConfigurationHistoryDeliveryCompleted: every six hours, if there are changes. One per resource type

AWS License Manager

  • AWS License Manager streamlines the process of onboarding software licenses to AWS. License Manager supports tracking any software that is licensed based on virtual cores (vCPUs), physical cores, sockets, or number of machines. This includes a variety of software products from Microsoft, IBM, SAP, Oracle, and other vendors. With AWS License Manager, you can centrally track licenses and enforce limits across multiple Regions, by maintaining a count of all the checked out entitlements. License Manager also tracks the end-user identity and the underlying resource identifier, if available, associated with each check out, along with the check-out time. This time-series data can be tracked to the ISV through CloudWatch metrics and events. ISVs can use this data for analytics, auditing, and other similar purposes.
  • License Manager is integrated with Amazon RDS, allowing you to monitor your Oracle license usage on Amazon RDS. Using License Manager along with AWS Systems Manager, you can manage licenses on physical or virtual servers hosted outside of AWS. You can use License Manager with AWS Organizations to manage all of your organizational accounts centrally.
  • CW Rule - Event Source: Service, Event Type: AWS API Call via CloudTrail

GuardDuty

  • Amazon GuardDuty is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity and delivers detailed security findings for visibility and remediation.
    • CloudTrail Log events
    • CloudTrail Management events
    • CloudTrail S3 data events
    • VPC Flow Logs
    • DNS Logs
  • Finding types
    • EC2 finding types
    • IAM finding types
    • S3 finding types
  • GuardDuty sends all of the findings it generates to Security Hub. The findings are sent to Security Hub using the AWS Security Finding Format (ASFF). Findings are sent to Security Hub usually within 5 minutes.

CloudFormation update policy

  • Use the UpdatePolicy attribute to specify how AWS CloudFormation handles updates to the AWS::AppStream::Fleet, AWS::AutoScaling::AutoScalingGroup, AWS::ElastiCache::ReplicationGroup, AWS::OpenSearchService::Domain, AWS::Elasticsearch::Domain, or AWS::Lambda::Alias resources.
  • UpdatePolicy.PauseTime - The amount of time that CloudFormation pauses after making a change to a batch of instances to give those instances time to start software applications. For example, you might need to specify PauseTime when scaling up the number of instances in an Auto Scaling group.
  • It is possible to override parameters on existing or stack instances being created.

Service Catalog

AWS Service Catalog allows Administrators to publish products and grant IAM users privileges
to launch the products without granting those users the ability to launch the underlying services. To launch a CloudFormation stack, the user needs privileges to launch all the underlying infrastructure in the stack.

Trusted Advisor

  • Trusted Advisor check reference - https://docs.aws.amazon.com/awssupport/latest/user/trusted-advisor-check-reference.html
  • Performance
    • Provisioned IOPS (SSD) volumes in the Amazon Elastic Block Store (Amazon EBS) are designed to deliver the expected performance only when they are attached to an EBS-optimized instance.
    • To optimize throughput performance, you should ensure that the maximum throughput of an Amazon EC2 instance is greater than the aggregate maximum throughput of the attached EBS volumes. This check computes the total EBS volume throughput for each five-minute period in the preceding day (based on Coordinated Universal Time (UTC)) for each EBS-optimized instance and alerts you if usage in more than half of those periods was greater than 95% of the maximum throughput of the EC2 instance.
    • Checks for cases where data transfer from Amazon Simple Storage Service (Amazon S3) buckets could be accelerated by using Amazon CloudFront, the AWS global content delivery service. When you configure CloudFront to deliver your content, requests for your content are automatically routed to the nearest edge location where content is cached. This routing allows content to be delivered to your users with the best possible performance. A high ratio of data transferred out compared to the data stored in the bucket indicates that you could benefit from using Amazon CloudFront to deliver the data.
    • Checks the HTTP request headers that CloudFront currently receives from the client and forwards to your origin server. Some headers, such as date, or user-agent, significantly reduce the cache hit ratio (the proportion of requests that are served from a CloudFront edge cache). This increases the load on your origin and reduces performance, because CloudFront must forward more requests to your origin.
  • Trusted Advisor provides real time guidance on
    • Cost Optimization
    • Performance
    • Security
    • Fault Tolerance
    • Service Limits

S3

  • S3 bucket configuration
    • CORS
    • Event notification
    • Lifecycle rules
    • Location (partition, region)
    • Access logging
    • Object lock
    • Object versioning
    • Policy and ACL
    • Replication
    • RequestPayment
    • Tagging
  • S3 Object Tagging and ACL
    • s3:ExistingObjectTag/<tag-key> – Use this condition key to verify that an existing object tag has the specific tag key and value.
    • s3:RequestObjectTagKeys – Use this condition key to restrict the tag keys that you want to allow on objects. This is useful when adding tags to objects using the PutObjectTagging and PutObject, and POST object requests.
    • s3:RequestObjectTag/<tag-key> – Use this condition key to restrict the tag keys and values that you want to allow on objects. This is useful when adding tags to objects using the PutObjectTagging and PutObject, and POST Bucket requests.

CloudWatch events

Service Event Type
GuardDuty GuardDuty Finding
Health AWS Health Event (issues or notifications) for different services
KMS KMS CMK Rotation, Deletion, Imported Key Material Expiration
Trusted Advisor Check Item Refresh Notification (OK, INFO, WARN, ERROR)
CodeCommit Repo State Change, Comment on Commit/PR, PR state change, Approval Rule Template Change
CodeDeploy State Change: CodeDeploy Deployment State-change notification
CodeDeploy State Change: CodeDeploy Instance State-change notification
CodePipeline CodePipeline Pipeline Execution State Change
CodePipeline CodePipeline Stage Execution State Change
CodePipeline CodePipeline Action Execution State Change
Systems Manager Run Command: EC2 Command Status-change Notification
Systems Manager Run Command: EC2 Command Invocation Status-change Notification
Systems Manager Maintenance Window
Systems Manager Automation: EC2 Automation Step Status-change Notification
Systems Manager Automation: EC2 Automation Execution Status-change Notification

OpsWorks Stacks

  • OpsWorks Stacks: Layer
    • Each layer can have a set of recipes assigned to each lifecycle event. Setup => Deploy => Configure => Undeploy => Shutdown.
  • OpsWorks Stacks: Instance types
    • 24/7 instances are started manually and run until you stop them.
    • Time-based instances are run by AWS OpsWorks Stacks on a specified daily and weekly schedule. They allow your stack to automatically adjust the number of instances to accommodate predictable usage patterns.
    • Load-based instances are automatically started and stopped by AWS OpsWorks Stacks, based on specified load metrics, such as CPU utilization.
  • OpsWorks Stacks: Use externally hosted cookbooks in Chef Supermarket
    • In you cookbook, declare dependency on another cookbook
    • Install and initialize Berkshelf, use Berkshelf to download and package dependencies

OpsWorks for Chef Automate

  • "pivotal" user is considered the superuser in Chef, and has full permissions, it is not a member of any organization, including the default organization that is used in AWS OpsWorks for Chef Automate. Add the "pivotal" user to the default organization by running knife opc.
⚠️ **GitHub.com Fallback** ⚠️