AWS_Storage - kamialie/knowledge_corner GitHub Wiki

Content

IOPS - I/O Operations per Second

Elastic Block Store

High performance network block storage.

EBS volumes are virtual hard drives (block-level storage volumes) that can be attached to EC2s. Objects are stored in small blocks, thus, changes are applied only to a changed block, and no full re-upload is needed, as in object storage. Multiple EBSs can be attached to one instance. EBS volume is an AZ level resource, thus, EC2 must be in the same AZ to be able to attach it. Replicated in multiple AZs in a region. EC2 launch configuration contains Delete on Termination option, which removes attached EBS - by default, set to true for root volume, and false for additional volumes. Capacity is provisioned (pay for chosen size).

Elastic Volume allows changing volume type and size without detaching.

Encrypted EBS volume implies encrypted data transfer between volume and instance. If account has encryption set by default or if volume is created from an encrypted snapshot, new volume is also encrypted (can not create unencrypted volume).

As of early 2020 EBS can be connected to up to 16 EC2s at the same time. Requires special kind of file system to handle concurrent write operations. Some restrictions apply:

  • can not be a boot volume
  • EC2 instances must be in the same AZ
  • EC2 instances must be based on Nitro system instance types

Volume types

Only SSD options are available as root volumes.

Disk type Option Family Max IOPS/Volume Description
SSD general purpose gp 16,000 balances price and performance, 2 generation are available: gp2 has linked (to the size) IOPS and throughput, while gp3 can scale each independently
SSD provisioned io 64,000 highest performance, great for apps that need more than 16k IOPS, database workloads and critical business apps; supports EBS multi-attach - attach same volume across multiple instances (up to 16) in the same AZ; over 32k IOPS are available on Nitro EC2, io2 is the latest generation, more durable and offers more IOPS per GB than io1
HDD throughput optimized st 500 frequently accessed, throughput-intensive workloads, use cases are BigData, Data Warehousing, etc
HDD cold sc 250 lowest cost, less frequently accessed data

SSD is generally good for transactional workload involving frequent read/write operations. HDD is good for streaming workloads that require optimized throughput.

io2 Block Express is a SAN (Storage Area Network) offering. Offers highest performance (up to 256,000 IOPS) and sub-millisecond latency.

EBS volume types summary doc.


Snapshots

Volume backup.

AWS allows to take incremental snapshots of the EBS volumes that serve as a backup, if EBS gets corrupted. Charged per GB per month. EBS fast snapshot restore is available for extra cost. Snapshot is a region level resource - volume from a snapshot can be created in any AZ within a region. Also snapshot can be copied over to any other region. Snapshot encryption setting is same as volume encryption setting.

To encrypt unencrypted volume, create a snapshot, then encrypt the snapshot by copying it and setting encryption to true, and finally create a volume from encrypted snapshot. It is also possible to set encryption while creating a volume from unencrypted snapshot.

EBS Snapshot Archive feature allows moving snapshot to archive tier, which is 75% cheaper. Restore takes between 24 and 72 hours.

FSR (Fast Snapshot Restore) forces full initialization to reduce latency on first use, comes with a big cost though.


Recycle bin

Recycle Bin can be set up to not remove volumes immediately to allow recovering from accidental deletion. Applies globally, also can be configured for AMIs. Retention can be set up from 1 day to 1 year.

Elastic File System

Fully managed POSIX NFS file system, designed for Linux workloads (regional resource).

How it works

Provides shared access to data - multiple instances can access (read/write) EFS at the same time. Automatically scales up or down. Provides configurable lifecycle data rules. Can be accessed through Direct Connect as well. Data is stored across multiple AZs within a region. Payment is based on the amount of data stored (elastic). Use cases are content management, web serving, Wordpress.

Security Group is attached to EFS to control incoming connections. Enable ingress on NFS port (2049).

Storage classes:

  • standard - frequently accessed file
  • infrequent access - lower prices to store, but with retrieval fee

Lifecycle management can move data between classes based on N days of last access.

Performance options:

  • general purpose (default)
  • max I/O - scales for high level of throughput, parallel access; best use for BigData, media processing

Throughput options:

  • bursting (default) - manages spikes
  • provisioned - consistent high throughput, client requests throughput value regardless of storage size

Availability options:

  • regional - multy-AZ, great for prod
  • one zone - great for dev, backup is enabled by default

Access point

Application-specific entry point into EFS. Enables access management. Can enforce user identity, different file system root and subdirectories for clients. IAM policy can be used to allow application to use a specific access point.


Manual attachment

  1. Install amazon-efs-utils package
  2. Create directory
  3. Mount using command given in AWS Console (by DNS)

FSx

For Windows - fully managed native Windows file system. Supports SMB protocol, Windows NTFS, ACLs, user quotas, and integrates with Active Directory. Built on SSD. Can be accessed from on-prem, and be configured to Multi-AZ. Data can also be backed up to S3.

For Lustre (Linux and cluster) - parallel distributed file system for large-scale computing. Fits well for machine learning, HPC, video processing, and so on. Also integrates with S3 - bucket can be read as a file system, while write requests can update bucket objects.

In both options capacity is provisioned beforehand.

File system deployment options:

  • Scratch - temporary storage, data is not replicated; offers high burst (x6 faster); good for short term processing
  • Persistent - long-term storage, data is replicated within same AZ; good for long-term processing, sensitive data

FSx gateway

Provides native access to FSx for Windows File Server with local caching of frequently accessed data. Deployed on customer side (on-prem). Usefull for group file sharing and home directories.

Simple Storage Service

General purpose object storage. Provides 99.95% - 99.99% availability, and 11 9's durability.

Use cases:

  • infrastructure hosting video, photo, music
  • data backup and storage for other services (EBS snapshots, AMI templates); EBS snapshots are saved to S3 by default
  • static web site hosting
  • app installers that clients can download

Bucket name must be unique throughout Amazon. Can not contain uppercase letters and underscores, can not be an IP address, must be 3-63 characters long and must start with a letter or digit.

Each object has a unique key, which is a full path to the object. Object consists of data and metadata (custom metadata can be added when object is being stored). Max size of a single object is 5TB; files over 5GB must be uploaded via Multi-Part Upload (recommended for over 100MB).

By default all buckets are private. Available in a single region, chosen at the time of creation. S3 offers strong consistency for PUT (new or overwrite) and DELETE requests - subsequent read or list requests immediately receive the latest versions.

S3 Transfer Acceleration enables fast, easy and secure file transfers over long distances. Uses CloudFront - first, data arrives to edge location, then uses optimized network path to reach S3. Additional charges apply. Only bucket owner can enable Transfer Acceleration on the bucket. Bucket name must be DNS compliant and do not contain periods. As soon as it is enabled, clients can use accelerate endpoint - <bucker_name>.s3-accelerate.amazonaws.com.

S3 Byte-Range Fetches parallelizes GET requests by requesting specific byte range. If specific range has failed, only that range can be requested one more time. Speeds up download. Can also be used to get a specific part of a large file, for example head of the file.

S3 Select (also S3 Glacier Select) allows client to retrieve less data by performing server side filtering. Uses SQL, aggregations are not available. Results in fast client side application performance and less costs (less traffic from S3).

sync command can be used to synchronize directories between buckets. Recursively copies files and directories (only if they contain one or more files) from source bucket to destination bucket. Also updates files, if they have different timestamps. In versioned bucket only current version is considered.

Glacier Vault Lock adopts a WORM (Write Once Read Many) model. With a lock policy in place once data is written it can not be modified or deleted (also policy itself).

S3 Object Lock has a similar functionality (versioning must be enabled).

Object retention settings:

  • Retention Period - lock object for a specified amount of time
  • Legal Hold - no expiration date

Modes:

  • Governance - users can't modify or delete an object version or alter it's lock unless they have special permissions
  • Compliance - locked object can not be deleted or altered by any user including root, retention mode can not be changed and retention period can not be shortened

Users can defined custom (user-defined) object metadata (key/value pairs). Must begin with x-amx-meta-. Object tags are also key/value pairs; mostly used for fine-grained permissions or analytics purposes. Both mechanisms can't be used for searching.

Storage classes

Docs

Name Description AZs Min Storage duration (days) charge Min billable object size
Standard (default) frequently accessed data >=3 - -
Standard-IA long-lived, infrequently accessed data - disaster recovery data, backups >=3 30 128KB
One Zone-IA long-lived, infrequently accessed, non-critical data 1 30 128KB
Glacier Instant Retrieval rarely accessed data (once per quarter) that still needs immediate access >=3 90 40KB
Glacier Flexible Retieval formerly Amazon Glacier, check retrieval options below - archive data, long-term backup >=3 90 40KB
Glacier Deep Archive the least expensive option, even longer retrieval time than Glacier >=3 180 40KB
Intelligent-Tiering transitions objects between Standard and Standard-IA classes; upon activation can also move to archive and deep archive tiers >=3 30 -

Objects must be stored in Standard or Standard-IA for a minimum 30 days before transition to Standard-IA or One Zone-IA.

All classes except Standard and Intelligent-Tiering have associated cost for retrieval per GB in addition to storing cost based on object size. Intelligent-Tiering charges transition and monitoring fee per 1000 objects per month.

Intelligent-Tiering tiers:

Name Configuration Object age (not accessed for x days)
Frequest Access Automatic default
Infrequest Access Automatic 30
Archive Instance Access Automatic 90
Archive Access Optional configurable from 90 to 700+
Deep Archive Access Optional configurable from 180 to 700+

Glacier

S3 Glacier stores archives (base unit of Glacier, up to 40TB) in vaults (containers). Vault can be created in AWS Console and requires name and region to be specified (name must be unique within a region). However, files can not be uploaded or downloaded through Console - must be done programmatically (CLI, SDK...) or by life-cycle rules. To upload data, first, archive is created in a vault, then data is uploaded to this archive (last step is to complete the upload). One Glacier bucket can have multiple vaults, each vault can have multiple archives. Archive gets an auto-generated ID and can also get optional description at the upload time. Retrieval is done through request. Client pays each time data is accessed.

Glacier Flexible Retrieval data retrieval has 3 options, which differ in retrieval speeds (make a request, wait for completion notification, download data):

  • expedited - 1-5 minutes, less that 250 MB
  • standard - 3-5 hours
  • bulk - 5-12 hours, petabytes, large amount of data

Glacier Deep Archive data retrieval:

  • standard - within 12 hours
  • bulk - within 48 hours

Data stored in Glacier is immutable. Query is possible across the data without retrieval.


Lifecycle

Lifecycle configuration rules transition objects between classes or delete them. Standard is the default storage class. Lifecycle transitioning does not work on objects that are less than 128KB. Rules can be applied to a specific path prefix or object tags.

  • transition actions move objects to another class (based on creation time or based on usage, which is provided only by Intelligent-Tiering)
  • expiration actions delete objects based on creation time

When versioning is enabled, expiration action marks a current version as deleted. From that date it becomes non-current version. When non-current version expiration date comes, the object is completely deleted. Expiration number of days must be higher than corresponding transition days.

S3 Analytics (paid feature) helps to determine when to transition objects from Standard to Standard-IA. Report is updated daily. Can be used to set up or improve lifecycle rules.


Versioning

Preserves and stores every version of objects. Can be enabled only at the bucket level. Delete action in default view creates a delete marker object as the latest version, while original object still remains. Deleting a delete marker effectively restores an object. Overwriting an object creates a new object with its own unique versionID. Deleting version in version view deletes corresponding object.

Once versioning is enabled it can only be suspended (meaning older versions are still kept in the bucket, but can be explicitly deleted). New uploads overwrite latest version.

Non-versioned object has a versionID equal to null.

MFA delete setting requires MFA authentication on versioning settings and permanently deleting object version actions. Can be enabled through CLI, SDK or API only by bucket owner (root account). Versioning feature must be enabled before enabling MFA delete.


Encryption

S3 exposes HTTP and HTTPS endpoints for uploading files. Encryption in transit uses HTTPS endpoint that includes SSL/TLS.

Encryption at rest options (SSE - Server Side Encryption):

Name Encryption Key Description
SSE-S3 S3 managed AWS managed, AES-256 type, enabled by default
SSE-KMS KMS managed provides fine-grained access to the key and audit trail, on the downside request are subject to KMS API limits
SSE-C Customer managed key is created and managed by customer outside of AWS and must be provided in every request header - save or download; not available in AWS Console, must use HTTPS endpoint
Client side Customer manager objects are encrypted by client before being sent to S3

AWS encryption can be set on a particular object or whole bucket (as default option). Objects that existed before bucket encryption was enabled will not be encrypted, because AWS encrypts data before sending it to S3 - pre-existing objects would need to be encrypted individually.

x-amz-server-side-encryption header is used to specify encryption type and key for HTTP requests - either AES256 (S3 managed key) or aws:kms values are accepted.

CMK - Customer Master Key.

Bucket encryption can be enforced by bucket setting, and a bucket policy. The latter does it by denying any PUT requests that do not include x-amz-server-side-encryption header. Condition below can be used to enforce using HTTPS endpoint:

"Condition" : {
  "Bool": {
    "aws:SecureTransport": "false"
  }
}

Bucket policies are evaluated before Default Encryption.


Replication

Versioning must be enabled in source and destination buckets. Replication is asynchronous, can be same or different region, and even in different accounts. Only new objects are replicated once setting is enabled. Delete markers can be optionally replicated as well, version deletion is not replicated. Replication chaining (1 to 2 to 3) is not possible.

Use cases:

  • same region - log aggregation, live replication between prod and test environments
  • different region - compliance, low latency access

If cross-region replication is enabled and source bucket is encrypted using SSE-S3 or SSE-KMS, the replica bucket will use the same encryption (if source has no encryption, replica can still enable its own).


Access permissions

Folder is also an object in S3, thus, requires same permissions as normal objects.

Bucket (and objects inside) permissions can be granted by identity, IAM policy, resource (bucket) policy or ACLs. Bucket policy specifies allowed actions for principles on a bucket level (allows multi-account access setup). ACLs can be specified on bucket or object level. Least privileged access is granted in the event of conflicts.

Grand anonymous access to object:

  • grand read permissions to everyone using object's ACL
  • add bucket policy statement granting everyone GetObject permission against the object

Block Public Access setting is used to explicitly deny any public permissions. Can be set on bucket or account level.

Pre-signed URLs are limited in time and can be used to grant access to authorized users. Can be generated using S3 Console, SDK or CLI. Default expiration is 1 hour; can be configured with Console up to 12 hours, CLI - up to 168 hours. Users given the pre-signed URL inherit permissions of the user, who created it. Can be used if the list of users that need access, change frequently or to grant temporary access. Can be used both for upload or download.

$ aws s3 presign s3://<bucket_name>/<path_to_object> --expires-in <seconds>
--region <region>

# In case there are errors while generating URL
$ aws configure set default.s3.signature_version s3v4

S3 Access Points

Access points provide more granularity from security stand point and simplify overall security management. Each Access Point has its own DNS (Internet or VPC origin) and permission policy (similar to bucket policy).

VPC origin provides internal access without going through the internet. To enable it VPC Endpoint needs to be configured (Gateway or Interface Endpoint), and VPC Endpoint policy should have both S3 bucket and Access Point permissions.

S3 Object Lambda can be configured to modify objects just before requester received them, f.e. analytics needs redacted data. Lambda requires S3 Access Point, while itself being behind S3 Object Lambda Access point (client -> S3 Object Lambda Access Point -> Lambda -> S3 Access Point -> bucket).


Website

S3 can host static websites. Exposed URL is either <bucket_name>.s3-website-<aws_region>.amazonaws.com or <bucket_name>.s3-website.<aws_region>.amazonaws.com

If bucket doesn't have public permissions, 403 error is returned. To enable public access disable Block Public Access settings and add a bucket policy with GetObject permission to all principals.

If a client does a cross-origin request on S3 bucket, CORS headers need to be enabled. Specific or all origins (*) can be allowed. Insert the following CORS policy in the configuration of the bucket that contains requested resource, allow other buckets or clients to access this bucket via CORS request (specify Origin URL without slash at the end):

[
    {
        "AllowedHeaders": [
            "Authorization"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "<origin_url>"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 3000
    }
]

More on CORS in AWS


Buckets management

Tagging can be used to categorize S3 objects for fine-grained permission control (f.e. grant a user "read-only" permissions on objects with specific tags), to mark objects for other AWS services (f.e. tag EBS volume to create automated snapshots using DLM), to specify objects that life-cycle rules are applied to; objects inside bucket do not inherit bucket tags.

  • tags can be added to new or existing objects
  • up to 10 tags for object
  • up to 128 unicode characters for key
  • up to 256 unique characters for value
  • both keys and values are case sensitive

Logging and Monitoring

Server access logging settings set up another S3 bucket to log all actions in the original bucket. Log format example. Can be later analyzed by 3rd party tools or Athena. Enabling this setting automatically updates log bucket ALCs to include original bucket's log delivery group with write permissions. Target bucket must be in the same region. Do not set original bucket as target bucket, as it will create a logging loop.

CloudWatch collected metrics:

  • daily storage metrics for buckets - BucketSizeBytes, NumberOfObjects (free, once per day)
  • request metrics - f.e. GetRequests, PostRequests, AllRequests, etc (not free)

Event Notifications

Notifies about specific action being performed on objects. Object name filtering (prefix and suffix) is supported. Targets can be SQS, SNS or Lambda. Requires Resource (Access) Policy on the destination side that authorizes S3 to deliver notifications. Now also supports Event Bridge.

If 2 writes are performed on the same non-versioned file, it is possible that only one event notification will be generated. Enable versioning to ensure every write generates an event notification.


Costs

  • storage cost - charged hourly (per GB); Glacier Deep Archive is the cheapest option
  • API cost for operation on files - charged per request (read & write); write is 10 times more expensive
  • data transfer outside AWS region - charged per GB

In general, bucket owner pays both for storage and for data transfer associated with the bucket. Requester Pays option makes the requester pay the costs for data transfer. The requester must be authenticated in AWS.


Athena

Serverless query service for analyzing data in S3 (built on Presto). Uses standard SQL for queries. Charges for the amount of data scanned. Files can be in CSV, JSON, ORC, Avro and Parquet.

Before running queries a result location must be set up, which is another S3 bucket. Also a database and a table must be created/configured.

Snow

Portable devices used to collect and process data at edge and/or migrate data into or out of AWS.

Edge computing is running EC2 and Lambda (using IoT Greengrass). Snowcone and Snowball can be used both for data migration and as a local compute and storage device, for example, in remote locations without internet access. 1 or 3 year term is available for edge computing.

Data can only be transferred to S3, Standard class (thus, devices have S3 compatible storage). OpsHub is a software to install on-prem to manage Snow Family device. To order a device create a job in AWS Console. Back shipment is tracked via SNS, text messages or in the Console.

Snowcone features 2 CPUs, 4 GB of memory, and 8 TB of usable storage. Can be physically sent back to AWS or used to transfer data via DataSync - gather data from remote location, bring to data center, then transfer data to AWS. Recommended for up to 24TB data migrations.

Snowball features multiple options and is recommended for up to PBs of data migrations. Up to 15 nodes can be set together (cluster) to increase total storage size.

  • storage optimized (80 TB, 40 vCPUs, 80 GiB) - large scale data-migration
  • compute optimized (42 TB, 52 vCPUs, 208 GiB) - machine learning, video analysis, analytics

Snowmobile - bigger scale Snowball, shipping container, on a truck!!! (up to 100 petabytes of data)

Storage gateway

Provides connection between on-premises storage with AWS storage solutions (on premises VM acts as a gate between AWS and on-prem).

All types of gateways assume on-prem server installation. Hardware appliance is also available - order physical server from Amazon. EC2 can also be set up as a gateway.

File gateway

Files are stored in S3 (one-to-one representation), and are asynchronously updated. Configured buckets are accessible using NFS and SMB protocols. Supports Standard, Standard-IA, and One zone IA storage classes.

Bucket access is granted through IAM roles used by File Gateway. Can be integrated with Active Directory to manage on-prem authentication.

Provides local latency and cache. Could be used to extend on-prem NFS.


Volume gateway

Provides block storage (block of files) using iSCSI protocol, acting as a virtual hard disk (EBS snapshots).

  • Cached - recently accessed data resides on-prem, complete copy on AWS
  • Stored - complete copy on premises, sending incremental backups to AWS

Tape gateway

Data is backed to Glacier (Virtual Tape Library) using tape-based processes (and iSCSI interface).


Other services

DataSync

Automated data transfer service.

Integrates with S3, EFS, FSx for Windows File Server. Synchronization runs on schedule - hourly, daily or weekly. Charges per GB transferred. Uses AWS custom protocol and optimizations.

For on-prem to AWS setup uses DataSync agent deployed on VM in a local network, which connects to on-prem NFS or SMB server.

Can also be used to synchronize data between EFSs in different regions - agent is deployed in source network, while DataSync endpoint is set up at destination.


Transfer Family

Fully-managed service for transferring data into or out of S3 or EFS using FTP protocol (supports FTP, FTPS, and SFTP).

Authentication can be set using authentication systems, like LDAP, AD, Cognito, or other custom solution. Users and credentials can be stored and managed directly within the service. Access to S3 or EFS is done vie IAM role assumed by transfer service.

Charges per provisioned endpoint per hour + data transfers in GB.


Backup

Managed service for central management and automated backups across AWS services. Supports cross-region and cross-account backups.

Provides PITR (Point In Time Recovery), on-demand or scheduled backups, tag-based policies. Backup Plan includes all properties such as frequency, backup window, transition to cold storage, retention settings.

Supported services:

  • FSx
  • EFS
  • DynamoDB
  • EC2
  • EBS
  • RDS + Aurora
  • Storage Gateway (volume type)
⚠️ **GitHub.com Fallback** ⚠️