AWS_Storage - kamialie/knowledge_corner GitHub Wiki
IOPS
- I/O Operations per Second
High performance network block storage.
EBS volumes
are virtual hard drives (block-level storage volumes) that can be attached
to EC2
s. Objects are stored in small blocks, thus, changes are applied only
to a changed block, and no full re-upload is needed, as in object storage.
Multiple EBS
s can be attached to one instance. EBS
volume is an AZ level
resource, thus, EC2
must be in the same AZ to be able to attach it.
Replicated in multiple AZs in a region. EC2
launch configuration contains
Delete on Termination
option, which removes attached EBS
- by default, set
to true for root volume, and false for additional volumes. Capacity is
provisioned (pay for chosen size).
Elastic Volume
allows changing volume type and size without detaching.
Encrypted EBS
volume implies encrypted data transfer between volume and
instance. If account has encryption set by default or if volume is created from
an encrypted snapshot, new volume is also encrypted (can not create unencrypted
volume).
As of early 2020 EBS
can be connected to up to 16 EC2
s at the same time.
Requires special kind of file system to handle concurrent write operations.
Some restrictions apply:
- can not be a boot volume
-
EC2
instances must be in the same AZ -
EC2
instances must be based on Nitro system instance types
Only SSD options are available as root volumes.
Disk type | Option | Family | Max IOPS/Volume | Description |
---|---|---|---|---|
SSD | general purpose | gp | 16,000 | balances price and performance, 2 generation are available: gp2 has linked (to the size) IOPS and throughput, while gp3 can scale each independently |
SSD | provisioned | io | 64,000 | highest performance, great for apps that need more than 16k IOPS, database workloads and critical business apps; supports EBS multi-attach - attach same volume across multiple instances (up to 16) in the same AZ; over 32k IOPS are available on Nitro EC2 , io2 is the latest generation, more durable and offers more IOPS per GB than io1
|
HDD | throughput optimized | st | 500 | frequently accessed, throughput-intensive workloads, use cases are BigData, Data Warehousing, etc |
HDD | cold | sc | 250 | lowest cost, less frequently accessed data |
SSD is generally good for transactional workload involving frequent read/write operations. HDD is good for streaming workloads that require optimized throughput.
io2 Block Express is a SAN (Storage Area Network) offering. Offers highest performance (up to 256,000 IOPS) and sub-millisecond latency.
Volume backup.
AWS allows to take incremental
snapshots
of the EBS
volumes that serve as a backup, if EBS
gets corrupted. Charged
per GB per month. EBS fast snapshot restore
is available for extra cost.
Snapshot is a region level resource - volume from a snapshot can be created in
any AZ within a region. Also snapshot can be copied over to any other region.
Snapshot encryption setting is same as volume encryption setting.
To encrypt unencrypted volume, create a snapshot, then encrypt the snapshot by
copying it and setting encryption to true
, and finally create a volume from
encrypted snapshot. It is also possible to set encryption while creating a
volume from unencrypted snapshot.
EBS Snapshot Archive
feature allows moving snapshot to archive tier, which
is 75% cheaper. Restore takes between 24 and 72 hours.
FSR
(Fast Snapshot Restore) forces full initialization to reduce latency on
first use, comes with a big cost though.
Recycle Bin can be set up to not remove volumes immediately to allow recovering
from accidental deletion. Applies globally, also can be configured for AMI
s.
Retention can be set up from 1 day to 1 year.
Fully managed POSIX NFS file system, designed for Linux workloads (regional resource).
Provides shared access to data - multiple instances can access (read/write)
EFS
at the same time. Automatically scales up or down. Provides configurable
lifecycle data rules. Can be accessed through Direct Connect
as well. Data is
stored across multiple AZs within a region. Payment is based on the amount of
data stored (elastic). Use cases are content management, web serving,
Wordpress.
Security Group
is attached to EFS
to control incoming connections. Enable
ingress on NFS
port (2049).
Storage classes:
- standard - frequently accessed file
- infrequent access - lower prices to store, but with retrieval fee
Lifecycle management can move data between classes based on N days of last access.
Performance options:
- general purpose (default)
- max I/O - scales for high level of throughput, parallel access; best use for BigData, media processing
Throughput options:
- bursting (default) - manages spikes
- provisioned - consistent high throughput, client requests throughput value regardless of storage size
Availability options:
- regional - multy-AZ, great for prod
- one zone - great for dev, backup is enabled by default
Application-specific entry point into EFS
. Enables access management. Can
enforce user identity, different file system root and subdirectories for
clients. IAM
policy can be used to allow application to use a specific access
point.
- Install
amazon-efs-utils
package - Create directory
- Mount using command given in AWS Console (by DNS)
For Windows - fully managed native Windows file system. Supports SMB protocol,
Windows NTFS, ACLs, user quotas, and integrates with Active Directory. Built on
SSD. Can be accessed from on-prem, and be configured to Multi-AZ. Data can also
be backed up to S3
.
For Lustre (Linux and cluster) - parallel distributed file system for
large-scale computing. Fits well for machine learning, HPC, video processing,
and so on. Also integrates with S3
- bucket can be read as a file system,
while write requests can update bucket objects.
In both options capacity is provisioned beforehand.
File system deployment options:
- Scratch - temporary storage, data is not replicated; offers high burst (x6 faster); good for short term processing
- Persistent - long-term storage, data is replicated within same AZ; good for long-term processing, sensitive data
Provides native access to FSx for Windows File Server
with local caching of
frequently accessed data. Deployed on customer side (on-prem). Usefull for
group file sharing and home directories.
General purpose object storage. Provides 99.95% - 99.99% availability, and 11 9's durability.
Use cases:
- infrastructure hosting video, photo, music
- data backup and storage for other services (
EBS
snapshots,AMI
templates);EBS
snapshots are saved toS3
by default - static web site hosting
- app installers that clients can download
Bucket name must be unique throughout Amazon. Can not contain uppercase letters and underscores, can not be an IP address, must be 3-63 characters long and must start with a letter or digit.
Each object has a unique key, which is a full path to the object. Object
consists of data and metadata (custom metadata can be added when object is
being stored). Max size of a single object is 5TB; files over 5GB must be
uploaded via Multi-Part Upload
(recommended for over 100MB).
By default all buckets are private. Available in a single region, chosen at the
time of creation. S3
offers strong consistency for PUT (new or overwrite)
and DELETE requests - subsequent read or list requests immediately receive
the latest versions.
S3 Transfer Acceleration
enables fast, easy and secure file transfers over
long distances. Uses CloudFront
- first, data arrives to edge location, then
uses optimized network path to reach S3
. Additional charges apply. Only bucket
owner can enable Transfer Acceleration
on the bucket. Bucket name must be DNS
compliant and do not contain periods. As soon as it is enabled, clients can use
accelerate endpoint - <bucker_name>.s3-accelerate.amazonaws.com
.
S3 Byte-Range Fetches
parallelizes GET requests by requesting specific byte
range. If specific range has failed, only that range can be requested one more
time. Speeds up download. Can also be used to get a specific part of a large
file, for example head of the file.
S3 Select
(also S3 Glacier Select
) allows client to retrieve less data by
performing server side filtering. Uses SQL, aggregations are not available.
Results in fast client side application performance and less costs (less
traffic from S3
).
sync
command can be used to synchronize directories between buckets.
Recursively copies files and directories (only if they contain one or more
files) from source bucket to destination bucket. Also updates files, if they
have different timestamps. In versioned bucket only current version is
considered.
Glacier Vault Lock
adopts a WORM (Write Once Read Many) model. With a lock
policy in place once data is written it can not be modified or deleted (also
policy itself).
S3 Object Lock
has a similar functionality (versioning must be
enabled).
Object retention settings:
- Retention Period - lock object for a specified amount of time
- Legal Hold - no expiration date
Modes:
- Governance - users can't modify or delete an object version or alter it's lock unless they have special permissions
- Compliance - locked object can not be deleted or altered by any user including root, retention mode can not be changed and retention period can not be shortened
Users can defined custom (user-defined) object metadata (key/value pairs). Must
begin with x-amx-meta-
. Object tags are also key/value pairs; mostly used for
fine-grained permissions or analytics purposes. Both mechanisms can't be used
for searching.
Name | Description | AZs | Min Storage duration (days) charge | Min billable object size |
---|---|---|---|---|
Standard (default) | frequently accessed data | >=3 | - | - |
Standard-IA | long-lived, infrequently accessed data - disaster recovery data, backups | >=3 | 30 | 128KB |
One Zone-IA | long-lived, infrequently accessed, non-critical data | 1 | 30 | 128KB |
Glacier Instant Retrieval | rarely accessed data (once per quarter) that still needs immediate access | >=3 | 90 | 40KB |
Glacier Flexible Retieval | formerly Amazon Glacier, check retrieval options below - archive data, long-term backup | >=3 | 90 | 40KB |
Glacier Deep Archive | the least expensive option, even longer retrieval time than Glacier | >=3 | 180 | 40KB |
Intelligent-Tiering | transitions objects between Standard and Standard-IA classes; upon activation can also move to archive and deep archive tiers | >=3 | 30 | - |
Objects must be stored in Standard or Standard-IA for a minimum 30 days before transition to Standard-IA or One Zone-IA.
All classes except Standard
and Intelligent-Tiering
have associated cost
for retrieval per GB in addition to storing cost based on object size.
Intelligent-Tiering charges transition and monitoring fee per 1000 objects
per month.
Intelligent-Tiering tiers:
Name | Configuration | Object age (not accessed for x days) |
---|---|---|
Frequest Access | Automatic | default |
Infrequest Access | Automatic | 30 |
Archive Instance Access | Automatic | 90 |
Archive Access | Optional | configurable from 90 to 700+ |
Deep Archive Access | Optional | configurable from 180 to 700+ |
S3 Glacier
stores archives (base unit of Glacier
, up to 40TB) in vaults
(containers). Vault can be created in AWS Console and requires name and region
to be specified (name must be unique within a region). However, files can not
be uploaded or downloaded through Console - must be done programmatically (CLI,
SDK...) or by life-cycle rules. To upload data, first, archive is created in a
vault, then data is uploaded to this archive (last step is to complete the
upload). One Glacier
bucket can have multiple vaults, each vault can have
multiple archives. Archive gets an auto-generated ID and can also get optional
description at the upload time. Retrieval is done through request. Client pays
each time data is accessed.
Glacier Flexible Retrieval
data retrieval has 3 options, which differ in retrieval speeds (make a
request, wait for completion notification, download data):
- expedited - 1-5 minutes, less that 250 MB
- standard - 3-5 hours
- bulk - 5-12 hours, petabytes, large amount of data
Glacier Deep Archive
data retrieval:
- standard - within 12 hours
- bulk - within 48 hours
Data stored in Glacier
is immutable. Query is possible across the data without
retrieval.
Lifecycle configuration rules transition objects between classes or delete them. Standard is the default storage class. Lifecycle transitioning does not work on objects that are less than 128KB. Rules can be applied to a specific path prefix or object tags.
-
transition actions move objects to another class (based on creation time or
based on usage, which is provided only by
Intelligent-Tiering
) - expiration actions delete objects based on creation time
When versioning is enabled, expiration action marks a current version as deleted. From that date it becomes non-current version. When non-current version expiration date comes, the object is completely deleted. Expiration number of days must be higher than corresponding transition days.
S3 Analytics
(paid feature) helps to determine when to transition objects
from Standard to Standard-IA. Report is updated daily. Can be used to
set up or improve lifecycle rules.
Preserves and stores every version of objects. Can be enabled only at the
bucket level. Delete action in default view creates a delete marker object
as the latest version, while original object still remains. Deleting a
delete marker effectively restores an object. Overwriting an object creates a
new object with its own unique versionID
. Deleting version in version view
deletes corresponding object.
Once versioning is enabled it can only be suspended (meaning older versions are still kept in the bucket, but can be explicitly deleted). New uploads overwrite latest version.
Non-versioned object has a versionID
equal to null.
MFA delete
setting requires MFA authentication on versioning settings and
permanently deleting object version actions. Can be enabled through CLI, SDK or
API only by bucket owner (root account). Versioning feature must be enabled
before enabling MFA delete
.
S3
exposes HTTP and HTTPS endpoints for uploading files. Encryption in
transit uses HTTPS endpoint that includes SSL/TLS.
Encryption at rest options (SSE - Server Side Encryption):
Name | Encryption Key | Description |
---|---|---|
SSE-S3 |
S3 managed |
AWS managed, AES-256 type, enabled by default |
SSE-KMS |
KMS managed |
provides fine-grained access to the key and audit trail, on the downside request are subject to KMS API limits |
SSE-C | Customer managed | key is created and managed by customer outside of AWS and must be provided in every request header - save or download; not available in AWS Console, must use HTTPS endpoint |
Client side | Customer manager | objects are encrypted by client before being sent to S3
|
AWS encryption can be set on a particular object or whole bucket (as default
option). Objects that existed before bucket encryption was enabled will not be
encrypted, because AWS encrypts data before sending it to S3
- pre-existing
objects would need to be encrypted individually.
x-amz-server-side-encryption
header is used to specify encryption type and
key for HTTP requests - either AES256
(S3 managed key) or aws:kms
values
are accepted.
CMK - Customer Master Key.
Bucket encryption can be enforced by bucket setting, and a bucket policy. The
latter does it by denying any PUT requests that do not include
x-amz-server-side-encryption
header. Condition below can be used to enforce
using HTTPS endpoint:
"Condition" : {
"Bool": {
"aws:SecureTransport": "false"
}
}
Bucket policies are evaluated before Default Encryption.
Versioning must be enabled in source and destination buckets. Replication is asynchronous, can be same or different region, and even in different accounts. Only new objects are replicated once setting is enabled. Delete markers can be optionally replicated as well, version deletion is not replicated. Replication chaining (1 to 2 to 3) is not possible.
Use cases:
- same region - log aggregation, live replication between prod and test environments
- different region - compliance, low latency access
If cross-region replication is enabled and source bucket is encrypted using SSE-S3 or SSE-KMS, the replica bucket will use the same encryption (if source has no encryption, replica can still enable its own).
Folder is also an object in S3
, thus, requires same permissions as normal
objects.
Bucket (and objects inside) permissions can be granted by identity, IAM
policy, resource (bucket) policy or ACLs. Bucket policy specifies allowed
actions for principles on a bucket level (allows multi-account access setup).
ACLs can be specified on bucket or object level. Least privileged access is
granted in the event of conflicts.
Grand anonymous access to object:
- grand read permissions to everyone using object's ACL
- add bucket policy statement granting everyone
GetObject
permission against the object
Block Public Access
setting is used to explicitly deny any public
permissions. Can be set on bucket or account level.
Pre-signed URLs are limited in time and can be used to grant access to authorized users. Can be generated using S3 Console, SDK or CLI. Default expiration is 1 hour; can be configured with Console up to 12 hours, CLI - up to 168 hours. Users given the pre-signed URL inherit permissions of the user, who created it. Can be used if the list of users that need access, change frequently or to grant temporary access. Can be used both for upload or download.
$ aws s3 presign s3://<bucket_name>/<path_to_object> --expires-in <seconds>
--region <region>
# In case there are errors while generating URL
$ aws configure set default.s3.signature_version s3v4
Access points provide more granularity from security stand point and simplify overall security management. Each Access Point has its own DNS (Internet or VPC origin) and permission policy (similar to bucket policy).
VPC origin provides internal access without going through the internet. To enable it VPC Endpoint needs to be configured (Gateway or Interface Endpoint), and VPC Endpoint policy should have both S3 bucket and Access Point permissions.
S3 Object Lambda can be configured to modify objects just before requester received them, f.e. analytics needs redacted data. Lambda requires S3 Access Point, while itself being behind S3 Object Lambda Access point (client -> S3 Object Lambda Access Point -> Lambda -> S3 Access Point -> bucket).
S3
can host static websites. Exposed URL is either
<bucket_name>.s3-website-<aws_region>.amazonaws.com
or
<bucket_name>.s3-website.<aws_region>.amazonaws.com
If bucket doesn't have public permissions, 403
error is returned. To enable
public access disable Block Public Access
settings and add a bucket policy
with GetObject
permission to all principals.
If a client does a cross-origin request on S3
bucket, CORS headers need to be
enabled. Specific or all origins (*
) can be allowed. Insert the following
CORS policy in the configuration of the bucket that contains requested
resource, allow other buckets or clients to access this bucket via CORS request
(specify Origin URL without slash at the end):
[
{
"AllowedHeaders": [
"Authorization"
],
"AllowedMethods": [
"GET"
],
"AllowedOrigins": [
"<origin_url>"
],
"ExposeHeaders": [],
"MaxAgeSeconds": 3000
}
]
More on CORS in AWS
Tagging can be used to categorize S3
objects for fine-grained permission
control (f.e. grant a user "read-only" permissions on objects with specific
tags), to mark objects for other AWS services (f.e. tag EBS
volume to create
automated snapshots using DLM
), to specify objects that life-cycle rules are
applied to; objects inside bucket do not inherit bucket tags.
- tags can be added to new or existing objects
- up to 10 tags for object
- up to 128 unicode characters for key
- up to 256 unique characters for value
- both keys and values are case sensitive
Server access logging settings set up another S3
bucket to log all actions in
the original bucket. Log format example.
Can be later analyzed by 3rd party tools or Athena
. Enabling this setting
automatically updates log bucket ALCs to include original bucket's log
delivery group with write permissions. Target bucket must be in the same
region. Do not set original bucket as target bucket, as it will create a
logging loop.
CloudWatch
collected metrics:
- daily storage metrics for buckets -
BucketSizeBytes
,NumberOfObjects
(free, once per day) - request metrics - f.e.
GetRequests
,PostRequests
,AllRequests
, etc (not free)
Notifies about specific action being performed on objects. Object name
filtering (prefix and suffix) is supported. Targets can be SQS
, SNS
or
Lambda
. Requires Resource (Access) Policy on the destination side that
authorizes S3 to deliver notifications. Now also supports Event Bridge.
If 2 writes are performed on the same non-versioned file, it is possible that only one event notification will be generated. Enable versioning to ensure every write generates an event notification.
- storage cost - charged hourly (per GB);
Glacier Deep Archive
is the cheapest option - API cost for operation on files - charged per request (read & write); write is 10 times more expensive
- data transfer outside AWS region - charged per GB
In general, bucket owner pays both for storage and for data transfer associated
with the bucket. Requester Pays
option makes the requester pay the costs for
data transfer. The requester must be authenticated in AWS.
Serverless query service for analyzing data in S3
(built on Presto). Uses
standard SQL for queries. Charges for the amount of data scanned. Files can be
in CSV, JSON, ORC, Avro and Parquet.
Before running queries a result location must be set up, which is another S3 bucket. Also a database and a table must be created/configured.
Portable devices used to collect and process data at edge and/or migrate data into or out of AWS.
Edge computing is running EC2
and Lambda
(using IoT Greengrass
).
Snowcone
and Snowball
can be used both for data migration and as a local
compute and storage device, for example, in remote locations without internet
access. 1 or 3 year term is available for edge computing.
Data can only be transferred to S3
, Standard class (thus, devices have S3
compatible storage). OpsHub
is a software to install on-prem to manage Snow
Family device. To order a device create a job in AWS Console. Back shipment is
tracked via SNS
, text messages or in the Console.
Snowcone
features 2 CPUs, 4 GB of memory, and 8 TB of usable storage. Can be
physically sent back to AWS or used to transfer data via DataSync
- gather
data from remote location, bring to data center, then transfer data to AWS.
Recommended for up to 24TB data migrations.
Snowball
features multiple options and is recommended for up to PBs of data
migrations. Up to 15 nodes can be set together (cluster) to increase total
storage size.
- storage optimized (80 TB, 40 vCPUs, 80 GiB) - large scale data-migration
- compute optimized (42 TB, 52 vCPUs, 208 GiB) - machine learning, video analysis, analytics
Snowmobile
- bigger scale Snowball
, shipping container, on a truck!!! (up to
100 petabytes of data)
Provides connection between on-premises storage with AWS storage solutions (on premises VM acts as a gate between AWS and on-prem).
All types of gateways assume on-prem server installation. Hardware appliance is
also available - order physical server from Amazon. EC2
can also be set up as
a gateway.
Files are stored in S3
(one-to-one representation), and are asynchronously
updated. Configured buckets are accessible using NFS and SMB protocols.
Supports Standard, Standard-IA, and One zone IA storage classes.
Bucket access is granted through IAM
roles used by File Gateway. Can be
integrated with Active Directory to manage on-prem authentication.
Provides local latency and cache. Could be used to extend on-prem NFS.
Provides block storage (block of files) using iSCSI protocol, acting as a
virtual hard disk (EBS
snapshots).
- Cached - recently accessed data resides on-prem, complete copy on AWS
- Stored - complete copy on premises, sending incremental backups to AWS
Data is backed to Glacier
(Virtual Tape Library) using tape-based processes
(and iSCSI interface).
Automated data transfer service.
Integrates with S3
, EFS
, FSx for Windows File Server
. Synchronization
runs on schedule - hourly, daily or weekly. Charges per GB transferred. Uses
AWS custom protocol and optimizations.
For on-prem to AWS setup uses DataSync
agent deployed on VM in a local
network, which connects to on-prem NFS or SMB server.
Can also be used to synchronize data between EFS
s in different regions -
agent is deployed in source network, while DataSync
endpoint is set up at
destination.
Fully-managed service for transferring data into or out of S3
or EFS
using
FTP protocol (supports FTP, FTPS, and SFTP).
Authentication can be set using authentication systems, like LDAP, AD,
Cognito
, or other custom solution. Users and credentials can be stored and
managed directly within the service. Access to S3
or EFS
is done vie IAM
role assumed by transfer service.
Charges per provisioned endpoint per hour + data transfers in GB.
Managed service for central management and automated backups across AWS services. Supports cross-region and cross-account backups.
Provides PITR (Point In Time Recovery), on-demand or scheduled backups, tag-based policies. Backup Plan includes all properties such as frequency, backup window, transition to cold storage, retention settings.
Supported services:
FSx
EFS
DynamoDB
EC2
EBS
-
RDS
+Aurora
-
Storage Gateway
(volume type)