02 | Simple Storage Service S3 - devian-al/AWS-Solutions-Architect-Prep GitHub Wiki

Simple Storage Service

S3 provides developers and IT teams with secure, durable, and highly-scalable object storage. Object storage, as opposed to block storage, is a general term that refers to data composed of three things:

the data that you want to store
an expandable amount of metadata
a unique identifier so that the data can be retrieved

This makes it a perfect candidate to host files or directories and a poor candidate to host databases or operating systems. The following table highlights key differences between object and block storage:

Screen Shot 2020-06-05 at 3 34 57 PM

Data uploaded into S3 is spread across multiple files and facilities. The files uploaded into S3 have an upper-bound of 5TB per file and the number of files that can be uploaded is virtually limitless. S3 buckets, which contain all files, are named in a universal namespace so uniqueness is required. All successful uploads will return an HTTP 200 response.

S3 Key Details

Objects (regular files or directories) are stored in S3 with a key, value, version ID, and metadata.
They can also contain torrents and sub resources for access control lists which are basically permissions for the object itself.
The data consistency model for S3 ensures immediate read access for new objects after the initial PUT requests. These new objects are introduced into AWS for the first time and thus do not need to be updated anywhere so they are available immediately.
The data consistency model for S3 also ensures immediate read access for PUTS and DELETES of already existing objects, since Decembre 2020.
Amazon guarantees 99.999999999% (or 11 9s) durability for all S3 storage classes except its Reduced Redundancy Storage class.
S3 comes with the following main features:
- tiered storage and pricing variability
- lifecycle management to expire older content
- versioning for version control
- encryption for privacy
- MFA deletes to prevent accidental or malicious removal of content
- access control lists & bucket policies to secure the data

S3 charges by

storage size
number of requests
storage management pricing (known as tiers)
data transfer pricing (objects leaving/entering AWS via the internet)
transfer acceleration (an optional speed increase for moving objects via Cloudfront)
cross region replication (more HA than offered by default)

Bucket Policies

Bucket policies secure data at the bucket level while access control lists secure data at the more granular object level.
By default, all newly created buckets are private.
S3 can be configured to create access logs which can be shipped into another bucket in the current account or even a separate account all together. This makes it easy to monitor who accesses what inside S3.
There are 3 different ways to share S3 buckets across AWS accounts:

For programmatic access only, use IAM & Bucket Policies to share entire buckets

For programmatic access only, use ACLs & Bucket Policies to share objects

For access via the console & the terminal, use cross-account IAM roles

S3 is a great candidate for static website hosting.
- When you enable static website hosting for S3 you need both an index.html file and an error.html file.
- Static website hosting creates a website endpoint that can be accessed via the internet.
When you upload new files and have versioning enabled, they will not inherit the properties of the previous version.

S3 Storage Classes:

S3 Standard - 99.99% availability and 11 9s durability. Data in this class is stored redundantly across multiple devices in multiple facilities and is designed to withstand the failure of 2 concurrent data centers.

S3 Infrequently Accessed (IA) - For data that is needed less often, but when it is needed the data should be available quickly. The storage fee is cheaper, but you are charged for retrieval.

S3 One Zone Infrequently Accessed (an improvement of the legacy RRS / Reduced Redundancy Storage) - For when you want the lower costs of IA, but do not require high availability. This is even cheaper because of the lack of HA.

S3 Intelligent Tiering - Uses built-in ML/AI to determine the most cost-effective storage class and then automatically moves your data to the appropriate tier. It does this without operational overhead or performance impact.

S3 Glacier - low-cost storage class for data archiving. This class is for pure storage purposes where retrieval isn’t needed often at all. Retrieval times range from minutes to hours. There are differing retrieval methods depending on how acceptable the default retrieval times are for you:

Expedited 1 - 5 minutes, but this option is the most expensive.
Standard 3 - 5 hours to restore.
Bulk 5 - 12 hours. This option has the lowest cost and is good for a large set of data

The Expedited duration listed above could possibly be longer during rare situations of unusually high demand across all of AWS. If it is absolutely critical to have quick access to your Glacier data under all circumstances, you must purchase Provisioned Capacity. Provisioned Capacity guarantees that Expedited retrievals always work within the time constraints of 1 to 5 minutes.

S3 Deep Glacier - The lowest cost S3 storage where retrieval can take 12 hours.

S3 Encryption

S3 data can be encrypted both in transit and at rest.

Encryption In Transit: When the traffic passing between one endpoint to another is indecipherable. Anyone eavesdropping between server A and server B won’t be able to make sense of the information passing by. Encryption in transit for S3 is always achieved by SSL/TLS.

Encryption At Rest : When the immobile data sitting inside S3 is encrypted. If someone breaks into a server, they still won’t be able to access encrypted info within that server. Encryption at rest can be done either on the server-side or the client-side. The server-side is when S3 encrypts your data as it is being written to disk and decrypts it when you access it. The client-side is when you personally encrypt the object on your own and then upload it into S3 afterwards.

You can encrypted on the AWS supported server-side in the following ways

S3 Managed Keys / SSE - S3 (server side encryption S3) - when Amazon manages the encryption and decryption keys for you automatically. In this scenario, you concede a little control to Amazon in exchange for ease of use.

AWS Key Management Service / SSE - KMS - when Amazon and you both manage the encryption and decryption keys together.

Server Side Encryption w/ customer provided keys / SSE - C - when I give Amazon my own keys that I manage. In this scenario, you concede ease of use in exchange for more control.

S3 Versioning

When versioning is enabled, S3 stores all versions of an object including all writes and even deletes.
It is a great feature for implicitly backing up content and for easy rollbacks in case of human error.
It can be thought of as analogous to Git.
Once versioning is enabled on a bucket, it cannot be disabled - only suspended
Versioning integrates w/ lifecycle rules so you can set rules to expire or migrate data based on their version.
Versioning also has MFA delete capability to provide an additional layer of security.

S3 Lifecycle Management

Automates the moving of objects between the different storage tiers.
Can be used in conjunction with versioning.
Lifecycle rules can be applied to both current and previous versions of an object.

S3 Cross Region Replication

Cross region replication only work if versioning is enabled
When cross region replication is enabled, no pre-existing data is transferred. Only new uploads into the original bucket are replicated. All subsequent updates are replicated.
When you replicate the contents of one bucket to another, you can actually change the ownership of the content if you want. You can also change the storage tier of the new bucket with the replicated content.
When files are deleted in the original bucket (via a delete marker as versioning prevents true deletions), those deletes are not replicated.
Cross Region Replication Overview
What is and isn’t replicated such as encrypted objects, deletes, items in glacier, etc.

S3 Transfer Acceleration

Transfer acceleration makes use of the CloudFront network by sending or receiving data at CDN points of presence (called edge locations) rather than slower uploads or downloads at the origin.
Transfer Acceleration cannot be disabled, and can only be suspended
This is accomplished by uploading to a distinct URL for the edge location instead of the bucket itself. This is then transferred over the AWS network backbone at a much faster speed.
You can test transfer acceleration speed directly in comparison to regular uploads.

S3 Event Notications

The Amazon S3 notification feature enables you to receive and send notifications when certain events happen in your bucket. To enable notifications, you must first configure the events you want Amazon S3 to publish (new object added, old object deleted, etc.) and the destinations where you want Amazon S3 to send the event notifications. Amazon S3 supports the following destinations where it can publish events:

Amazon SNS
Amazon SQS
AWS Lambda

S3 and ElasticSearch

If you are using S3 to store log files, ElasticSearch provides full search capabilities for logs and can be used to search through data stored in an S3 bucket.
You can integrate your ElasticSearch domain with S3 and Lambda.

In this setup, any new logs received by S3 will trigger an event notification to Lambda, which in turn will then run your application code on the new log data. After your code finishes processing, the data will be streamed into your ElasticSearch domain and be available for observation.

Maximizing S3 Read/Write Performance:

If the request rate for reading and writing objects to S3 is extremely high, you can use sequential date-based naming for your prefixes to improve performance.
- Earlier versions of the AWS Docs also suggested to use hash keys or random strings to prefix the object's name. In such cases, the partitions used to store the objects will be better distributed and therefore will allow better read/write performance on your objects.
If your S3 data is receiving a high number of GET requests from users, you should consider using Amazon CloudFront for performance optimization.
By integrating CloudFront with S3, you can distribute content via CloudFront's cache to your users for lower latency and a higher data transfer rate. This also has the added bonus of sending fewer direct requests to S3 which will reduce costs

For example, suppose that you have a few objects that are very popular. CloudFront fetches those objects from S3 and caches them. CloudFront can then serve future requests for the objects from its cache, reducing the total number of GET requests it sends to Amazon S3.

More information on how to ensure high performance in S3

S3 Server Access Logging

Server access logging provides detailed records for the requests that are made to a bucket.
Server access logs are useful for many applications.

For example, access log information can be useful in security and access audits. It can also help you learn about your customer base and better understand your Amazon S3 bill.

By default, logging is disabled When logging is enabled, logs are saved to a bucket in the same AWS Region as the source bucket.
Each access log record provides details about a single access request, such as the requester, bucket name, request time, request action, response status, and an error code, if relevant.
It works in the following way:
- S3 periodically collecting access log records of the bucket you want to monitor
- S3 then consolidates those records into log files
- S3 finally uploads the log files to your secondary monitoring bucket as log objects

S3 Multipart Upload

Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order.
Multipart uploads are recommended for files over 100 MB and is the only way to upload files over 5 GB. It achieves functionality by uploading your data in parallel to boost efficiency.
If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.
Possible reasons for why you would want to use Multipart upload:
- Multipart upload delivers the ability to begin an upload before you know the final object size.
- Multipart upload delivers improved throughput.
- Multipart upload delivers the ability to pause and resume object uploads.
- Multipart upload delivers quick recovery from network issues.
You can use an AWS SDK to upload an object in parts. Alternatively, you can perform the same action via the AWS CLI.
You can also parallelize downloads from S3 using byte-range fetches. If there's a failure during the download, the failure is localized just to the specific byte range and not the whole object.

S3 Pre-signed URLs

All S3 objects are private by default, however the object owner of a private bucket with private objects can optionally share those objects with without having to change the permissions of the bucket to be public.
This is done by creating a pre-signed URL. Using your own security credentials, you can grant time-limited permission to download or view your private S3 objects.
When you create a pre-signed URL for your S3 object, you must do the following:
- Provide your security credentials.
- Specify a bucket.
- Specify an object key.
- Specify the HTTP method (GET to download the object).
- Specify the expiration date and time.

The pre-signed URLs are valid only for the specified duration and anyone who receives the pre-signed URL within that duration can then access the object.

The following diagram highlights how Pre-signed URLs work:

Screen Shot 2020-06-09 at 8 20 53 PM

S3 Select

S3 Select is an Amazon S3 feature that is designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3.
Most applications have to retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service.

As an example, let’s imagine you’re a developer at a large retailer and you need to analyze the weekly sales data from a single store, but the data for all 200 stores is saved in a new GZIP-ed CSV every day.

Without S3 Select, you would need to download, decompress and process the entire CSV to get the data you needed.
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object.
By reducing the volume of data that has to be loaded and processed by your applications, S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400% because you’re dealing with significantly less data.
You can also use S3 Select for Glacier.

S3 Access Points

Named network endpoints attached to buckets that can be used to perform operations such as GetObject' and PutObject`
Simplifies data access for any AWS service or customer application that stores data in S3
Each AP has distinct permissions and network controls that S3 applies for any request made via the AP
Each AP enforces customized access point policy that works in conjunction with the bucket policy attached to the bucket
Can be configured to accept requests only from a VPC to restrict S3 data access to private network
custom block public access can be configured
S3 Multi-Region Access Point can be used to provide global endpoint that applications can use to fulfill requests from S3 buckets located in multiple AWS Regions
S3 Multi-Region APs use AWS Global Accelerator

Networking

Hosted-style access
- Amazon S3 routes any virtual hosted–style requests to the US East (N. Virginia) region by default if you use the endpoint s3.amazonaws.com, instead of the region-specific endpoint.
- Format:
  - http://bucket.s3.amazonaws.com
  - http://bucket.s3-aws-region.amazonaws.com
  - Path-style access
Path-style URL
- the endpoint you use must match the Region in which the bucket resides.
- Format:
  - US East (N. Virginia) Region endpoint, http://s3.amazonaws.com/bucket
  - Region-specific endpoint, http://s3-aws-region.amazonaws.com/bucket
  - Customize S3 URLs with CNAMEs – the bucket name must be the same as the CNAME.
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. It takes advantage of Amazon CloudFront’s globally distributed edge locations.
Transfer Acceleration cannot be disabled, and can only be suspended.
Transfer Acceleration URL is: bucket.s3-accelerate.amazonaws.com

Security

Policies contain the following:
- Resources – buckets and objects
- Actions – set of operations
- Effect – can be either allow or deny. Need to explicitly grant allow to a resource.
- Principal – the account, service or user who is allowed access to the actions and resources in the statement.
Resource Based Policies
- Bucket Policies
  - Provides centralized access control to buckets and objects based on a variety of conditions, including S3 operations, requesters, resources, and aspects of the request (e.g., IP address).
  - Can either add or deny permissions across all (or a subset) of objects within a bucket.
  - IAM users need additional permissions from root account to perform bucket operations.
  - Bucket policies are limited to 20 KB in size.
- Access Control Lists
  - A list of grants identifying grantee and permission granted.
  - ACLs use an S3–specific XML schema.
  - You can grant permissions only to other AWS accounts, not to users in your account.
  - You cannot grant conditional permissions, nor explicitly deny permissions.
  - Object ACLs are limited to 100 granted permissions per ACL.
  - The only recommended use case for the bucket ACL is to grant write permissions to the S3 Log Delivery group
- User Policies
  - AWS IAM (see AWS Security and Identity Services)
    - IAM User Access Keys
    - Temporary Security Credentials