Aws s3 interfaces - frallain/wiki GitHub Wiki
https://stackoverflow.com/questions/4663016/faster-s3-bucket-duplication
AWS s3 cp
https://stackoverflow.com/questions/26326408/difference-between-s3cmd-boto-and-aws-cli
# wildcard ? awkwardly supported
aws s3 cp sta_cam139D_2017-11-02 s3://alex-chute-data/ --recursive --exclude "*" --include "*05h*" --include "*06h*" --include "*17h*" --include "*18h*"
# metadata ? Yes
aws s3 cp --metadata profile_id=3748 tau_C0AE_2017_11_16_8_0.7z s3://alex-admin/
-> No way to keep directory structure intact when copying a folder with subfolders (it results in a flat list of files...), either with aws s3 cp
and aws s3 sync
https://serverfault.com/questions/682708/copy-directory-structure-intact-to-aws-s3-bucket
https://github.com/aws/aws-cli/issues/2069
aws s3 mv
- does remove the file after being uploaded (https://stackoverflow.com/questions/32854429/move-files-to-s3-then-remove-source-files-after-completion)
- does keep directory structure
- does upload the contents of the directory that is synced, not the directory (but there is a way around solution)
- does not delete the empty folders at the end
- cannot throttle bandwidth : https://serverfault.com/questions/741752/limiting-bandwidth-of-the-of-amazon-aws-s3-uploads
aws s3 mv --recursive 2 s3://cainthus-dummy-test/
move: 2/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg
move: 2/3697/37D7/2018-02-23/14/2018-02-23T14:06:08.921653+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/14/2018-02-23T14:06:08.921653+00:00.jpg
move: 2/3697/37D7/2018-02-23/14/2018-02-23T14:36:29.679298+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/14/2018-02-23T14:36:29.679298+00:00.jpg
move: 2/3697/37D7/2018-02-23/13/2018-02-23T13:51:08.413945+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/13/2018-02-23T13:51:08.413945+00:00.jpg
move: 2/3697/37D7/2018-02-23/15/2018-02-23T15:06:29.687193+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/15/2018-02-23T15:06:29.687193+00:00.jpg
expected result would have been
s3://cainthus-dummy-test/2/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg
instead of
s3://cainthus-dummy-test/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg
same result for `aws s3 mv --recursive ../data-collector/3 s3://cainthus-dummy-test/`
# It is always the contents of the directory that is synced, not the directory so when source:path is a directory, itโs the contents of source:path that are copied, not the directory name and contents.
# way around solution:
aws s3 mv --recursive directory_to_upload s3://cainthus-dummy-test/directory_to_upload
increased speed?
Tune AWS CLI S3 Configuration values as per http://docs.aws.amazon.com/cli/latest/topic/s3-config.html.
The below increased an S3 sync speed by at least 8x!
Example:
$ more ~/.aws/config
[default]
aws_access_key_id=foo
aws_secret_access_key=bar
s3 =
max_concurrent_requests = 100
max_queue_size = 30000
s3cmd
-
Latest commit 14f7c2e 2018-03-01
-
multipart natively, ability to set AWS s3 metadata : i.e. for a
profile-id
metadata field :--add-header x-amz-meta-profile-id:3733
-
https://community.exoscale.ch/documentation/storage/metadata/
-
https://blog.miguelangelnieto.net/posts/multipart_uploads_to_s3.html
-
delete files after upload?
- not with
s3cmd put
, https://github.com/s3tools/s3cmd/issues/262 - not with
s3cmd sync --delete-after
, cf. https://github.com/s3tools/s3cmd/issues/958
- not with
sudo pip install s3cmd
# wildcard? Yes
s3cmd put file-* s3://logix.cz-test/
# metadata? yes
s3cmd put file s3://logix.cz-test/ --add-header x-amz-meta-profile-id:3733
# limit upload bandwidth? Yes
s3cmd put file-* s3://logix.cz-test/ --limit-rate=90k
--limit-rate=LIMITRATE
Limit the upload or download speed to amount bytes per
second. Amount may be expressed in bytes, kilobytes
with the k suffix, or megabytes with the m suffix
s3cmd put {} s3://alex-chute-data/ โadd-header='Cache-Control: public, max-age=31536000'
s3cmd sync --delete-after 2 s3://dummy-test/
preserves directory hierarchy
does not delete anything after
s4cmd = s3cmd multithread
- https://github.com/bloomreach/s4cmd
- Latest commit f5f5ff0 on Feb 8, 2017
- "Inspired by s3cmd, It strives to be compatible with the most common usage scenarios for s3cmd. It does not offer exact drop-in compatibility, due to a number of corner cases where different behavior seems preferable, or for bugfixes."
- a python2/3 script in 1500 lines using boto3 :
pip install s4cmd
s4cmd ls [path]
List all contents of a directory.
-r/--recursive: recursively display all contents including subdirectories under the given path.
-d/--show-directory: show the directory entry instead of its content.
s4cmd put [source] [target]
Upload local files up to S3.
-r/--recursive: also upload directories recursively.
-s/--sync-check: check md5 hash to avoid uploading the same content.
-f/--force: override existing file instead of showing error message.
-n/--dry-run: emulate the operation without real upload.
s4cmd get [source] [target]
Download files from S3 to local filesystem.
-r/--recursive: also download directories recursively.
-s/--sync-check: check md5 hash to avoid downloading the same content.
-f/--force: override existing file instead of showing error message.
-n/--dry-run: emulate the operation without real download.
s4cmd dsync [source dir] [target dir]
Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories.
-r/--recursive: also sync directories recursively.
-s/--sync-check: check md5 hash to avoid syncing the same content.
-f/--force: override existing file instead of showing error message.
-n/--dry-run: emulate the operation without real sync.
--delete-removed: delete files not in source directory.
s4cmd sync [source] [target]
(Obsolete, use dsync instead) Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories. This command simply invoke get/put/mv commands.
-r/--recursive: also sync directories recursively.
-s/--sync-check: check md5 hash to avoid syncing the same content.
-f/--force: override existing file instead of showing error message.
-n/--dry-run: emulate the operation without real sync.
--delete-removed: delete files not in source directory. Only works when syncing local directory to s3 directory.
s4cmd cp [source] [target]
Copy a file or a directory from a S3 location to another.
-r/--recursive: also copy directories recursively.
-s/--sync-check: check md5 hash to avoid copying the same content.
-f/--force: override existing file instead of showing error message.
-n/--dry-run: emulate the operation without real copy.
s4cmd mv [source] [target]
Move a file or a directory from a S3 location to another.
-r/--recursive: also move directories recursively.
-s/--sync-check: check md5 hash to avoid moving the same content.
-f/--force: override existing file instead of showing error message.
-n/--dry-run: emulate the operation without real move.
s4cmd del [path]
Delete files or directories on S3.
-r/--recursive: also delete directories recursively.
-n/--dry-run: emulate the operation without real delete.
s4cmd du [path]
Get the size of the given directory.
Available parameters:
-r/--recursive: also add sizes of sub-directories recursively.
rclone
- https://github.com/ncw/rclone
- '"rsync for cloud storage" - Google Drive, Amazon Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Cloudfiles, Google Cloud Storage, Yandex Files'
- https://rclone.org/docs/
rclone uses a system of subcommands. For example
rclone ls remote:path # lists a re
rclone copy /local/path remote:path # copies /local/path to the remote
rclone sync /local/path remote:path # syncs /local/path to the remote
The main rclone commands with most used first
rclone config - Enter an interactive configuration session.
rclone copy - Copy files from source to dest, skipping already copied
rclone sync - Make source and dest identical, modifying destination only.
rclone move - Move files from source to dest.
rclone delete - Remove the contents of path.
rclone purge - Remove the path and all of its contents.
rclone mkdir - Make the path if it doesnโt already exist.
rclone rmdir - Remove the path.
rclone rmdirs - Remove any empty directories under the path.
rclone check - Checks the files in the source and destination match.
rclone ls - List all the objects in the path with size and path.
rclone lsd - List all directories/containers/buckets in the path.
rclone lsl - List all the objects path with modification time, size and path.
rclone md5sum - Produces an md5sum file for all the objects in the path.
rclone sha1sum - Produces an sha1sum file for all the objects in the path.
rclone size - Returns the total size and number of objects in remote:path.
rclone version - Show the version number.
rclone cleanup - Clean up the remote if possible
rclone dedupe - Interactively find duplicate files delete/rename them.
rclone authorize - Remote authorization.
rclone cat - Concatenates any files and sends them to stdout.
rclone copyto - Copy files from source to dest, skipping already copied
rclone genautocomplete - Output shell completion scripts for rclone.
rclone gendocs - Output markdown docs for rclone to the directory supplied.
rclone listremotes - List all the remotes in the config file.
rclone mount - Mount the remote as a mountpoint. EXPERIMENTAL
rclone moveto - Move file or directory from source to dest.
rclone obscure - Obscure password for use in the rclone.conf
rclone cryptcheck - Checks the integrity of a crypted remote.
rclone sync makes the DESTINATION match the SOURCE and therefore deletes files on DESTINATION if not in SOURCE.
https://forum.rclone.org/t/does-rclone-sync-only-delete-files-at-destination-and-never-source/785/3
# wildcard? No
workaround : https://forum.rclone.org/t/how-to-use-wildcards/529
# metadata? not totally
https://forum.rclone.org/t/rclone-support-copying-object-source-metadata-and-acls/441
# limit upload bandwidth? Yes, can be scheduled + be set through an ENV VAR
--bwlimit BwTimetable Bandwidth limit in kBytes/s, or use suffix b|k|M|G or a full timetable.
--bwlimit 1M
--bwlimit "10:00,512 16:00,64 23:00,off"
export RCLONE_BWLIMIT=50k
https://github.com/ncw/rclone/issues/1227
https://github.com/ncw/rclone/blob/b2a4ea9304644c6ed2a7f541562ecb93b79153c9/docs/content/docs.md#config-file
mc
https://github.com/minio/mc multipart, no metadata
ls List files and folders.
mb Make a bucket or a folder.
cat Display file and object contents.
pipe Redirect STDIN to an object or file or STDOUT.
share Generate URL for sharing.
cp Copy files and objects.
mirror Mirror buckets and folders.
find Finds files which match the given set of parameters.
diff List objects with size difference or missing between two folders or buckets.
rm Remove files and objects.
events Manage object notifications.
watch Watch for file and object events.
policy Manage anonymous access to objects.
session Manage saved sessions for cp command.
config Manage mc configuration file.
update Check for a new software update.
version Print version info.
s3-parallel-put
- https://github.com/mishudark/s3-parallel-put
- "s3-parallel-put speeds the uploading of many small keys to Amazon AWS S3 by executing multiple PUTs in parallel."
- Latest commit 10ae16b on Jul 27, 2017
- https://chris-lamb.co.uk/posts/uploading-large-number-files-to-amazon-s3
- a Python2 (only) script using boto
s3-parallel-put --bucket=BUCKET --prefix=PREFIX SOURCE
Keys are computed by combining PREFIX with the path of the file, starting from SOURCE. Values are file contents.
Options:
-h, --help โ show help message
S3 options:
--bucket=BUCKET โ set bucket
--bucket_region=BUCKET_REGION โ set bucket region if not in us-east-1 (default new bucket region)
--host=HOST โ set AWS host name
--secure and --insecure control whether a secure connection is used
Source options:
--walk=MODE โ set walk mode (filesystem or tar)
--exclude=PATTERN โ exclude files matching PATTERN
--include=PATTERN โ don't exclude files matching PATTERN
s3-cli
-
inplace replace to s3cmd, written in Node :
sudo npm install -g s3-cli
-
Latest commit 7697fd8 on Apr 21, 2017
-
Supports a subset of s3cmd's commands and parameters including put, get, del, ls, sync, cp, mv
-
When syncing directories, instead of uploading one file at a time, it uploads many files in parallel resulting in more bandwidth.
-
Uses multipart uploads for large files and uploads each part in parallel.
-
Retries on failure
put
Uploads a file to S3. Assumes the target filename to be the same as the source filename (if none specified)
s3-cli put /path/to/file s3://bucket/key/on/s3
s3-cli put /path/to/source-file s3://bucket/target-file
Options:
--acl-public or -P - Store objects with ACL allowing read for anyone.
--default-mime-type - Default MIME-type for stored objects. Application default is binary/octet-stream.
--no-guess-mime-type - Don't guess MIME-type and use the default type instead.
--add-header=NAME:VALUE - Add a given HTTP header to the upload request. Can be used multiple times. For instance set 'Expires' or 'Cache-Control' headers (or both) using this options if you like.
--region=REGION-NAME - Specify the region (defaults to us-east-1)
get
Downloads a file from S3.
s3-cli get s3://bucket/key/on/s3 /path/to/file
del
Deletes an object or a directory on S3.
s3-cli del [--recursive] s3://bucket/key/on/s3/
ls
Lists S3 objects.
s3-cli ls [--recursive] s3://mybucketname/this/is/the/key/
sync
Sync a local directory to S3
s3-cli sync [--delete-removed] /path/to/folder/ s3://bucket/key/on/s3/
Supports the same options as put.
Sync a directory on S3 to disk
s3-cli sync [--delete-removed] s3://bucket/key/on/s3/ /path/to/folder/
cp
Copy an object which is already on S3.
s3-cli cp s3://sourcebucket/source/key s3://destbucket/dest/key
mv
Move an object which is already on S3.
s3-cli mv s3://sourcebucket/source/key s3://destbucket/dest/key
s3s3mirror
- https://github.com/cobbzilla/s3s3mirror
- Latest commit 116688e on Feb 25, 2017
- "Mirror one S3 bucket to another S3 bucket, or to/from the local filesystem."
- requires Java 6 or Java 7 or docker container at https://hub.docker.com/r/pmoust/s3s3mirror/
s3funnel
https://github.com/sstoiana/s3funnel "multithreaded command line tool for Amazon's Simple Storage Service (S3" Latest commit fe53718 on Feb 29, 2012
https://aws.amazon.com/code/s3funnel-multi-threaded-command-line-tool-for-s3/
$ s3funnel --help
Usage: s3funnel BUCKET OPERATION [OPTIONS] [FILE]...
s3funnel is a multithreaded tool for performing operations on Amazon's S3.
Key Operations:
DELETE Delete key from the bucket
GET Get key from the bucket
PUT Put file into the bucket (key is the basename of the path)
Bucket Operations:
CREATE Create a new bucket
DROP Delete an existing bucket (must be empty)
LIST List keys in the bucket. If no bucket is given, buckets will be listed.
Options:
-h, --help show this help message and exit
-a AWS_KEY, --aws_key=AWS_KEY
Overrides AWS_ACCESS_KEY_ID environment variable
-s AWS_SECRET_KEY, --aws_secret_key=AWS_SECRET_KEY
Overrides AWS_SECRET_ACCESS_KEY environment variable
-t N, --threads=N Number of threads to use [default: 1]
-T SECONDS, --timeout=SECONDS
Socket timeout time, 0 is never [default: 0]
--insecure Don't use secure (https) connection
--list-marker=KEY (`list` only) Start key for list operation
--list-prefix=STRING (`list` only) Limit results to a specific prefix
--list-delimiter=CHAR
(`list` only) Treat value as a delimiter for
hierarchical listing
--put-acl=ACL (`put` only) Set the ACL permission for each file
[default: public-read]
--put-full-path (`put` only) Use the full given path as the key name,
instead of just the basename
--put-only-new (`put` only) Only PUT keys which don't already exist
in the bucket with the same md5 digest
--put-header=HEADERS (`put` only) Add the specified header to the request
--source-bucket=SOURCE_BUCKET
(`copy` only) Source bucket for files
-i FILE, --input=FILE
Read one file per line from a FILE manifest
-v, --verbose Enable verbose output. Use twice to enable debug
output
--version Output version information and exit
s3cmd-modification
https://github.com/pcorliss/s3cmd-modification https://github.com/pearltrees/s3cmd-modification https://github.com/pcorliss/s3cmd-modification/pull/2 "Modification of the s3cmd by s3tools.org to add parallel downloads and uploads. Forked as of revision 437. " Latest commit 9a65f78 on Aug 28, 2010