Aws s3 interfaces - frallain/wiki GitHub Wiki

https://serverfault.com/questions/386910/which-is-the-fastest-way-to-copy-400g-of-files-from-an-ec2-elastic-block-store-v

https://stackoverflow.com/questions/4663016/faster-s3-bucket-duplication

AWS s3 cp

https://stackoverflow.com/questions/26326408/difference-between-s3cmd-boto-and-aws-cli

https://serverfault.com/questions/737507/how-to-upload-a-large-file-using-aws-commandline-when-connection-may-be-unreliab

# wildcard ? awkwardly supported
aws s3 cp sta_cam139D_2017-11-02 s3://alex-chute-data/ --recursive --exclude "*" --include "*05h*"  --include "*06h*"  --include "*17h*" --include "*18h*"

# metadata ? Yes
aws s3 cp --metadata profile_id=3748 tau_C0AE_2017_11_16_8_0.7z  s3://alex-admin/

-> No way to keep directory structure intact when copying a folder with subfolders (it results in a flat list of files...), either with aws s3 cp and aws s3 sync https://serverfault.com/questions/682708/copy-directory-structure-intact-to-aws-s3-bucket

https://github.com/aws/aws-cli/issues/2069

aws s3 mv

aws s3 mv --recursive 2 s3://cainthus-dummy-test/
move: 2/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg
move: 2/3697/37D7/2018-02-23/14/2018-02-23T14:06:08.921653+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/14/2018-02-23T14:06:08.921653+00:00.jpg
move: 2/3697/37D7/2018-02-23/14/2018-02-23T14:36:29.679298+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/14/2018-02-23T14:36:29.679298+00:00.jpg
move: 2/3697/37D7/2018-02-23/13/2018-02-23T13:51:08.413945+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/13/2018-02-23T13:51:08.413945+00:00.jpg
move: 2/3697/37D7/2018-02-23/15/2018-02-23T15:06:29.687193+00:00.jpg to s3://cainthus-dummy-test/3697/37D7/2018-02-23/15/2018-02-23T15:06:29.687193+00:00.jpg

expected result would have been 
s3://cainthus-dummy-test/2/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg
instead of
s3://cainthus-dummy-test/3697/37D7/2018-02-23/13/2018-02-23T13:21:03.077095+00:00.jpg

same result for `aws s3 mv --recursive ../data-collector/3 s3://cainthus-dummy-test/`
# It is always the contents of the directory that is synced, not the directory so when source:path is a directory, itโ€™s the contents of source:path that are copied, not the directory name and contents.
# way around solution:
aws s3 mv --recursive directory_to_upload s3://cainthus-dummy-test/directory_to_upload

increased speed?

from https://serverfault.com/questions/386910/which-is-the-fastest-way-to-copy-400g-of-files-from-an-ec2-elastic-block-store-v

Tune AWS CLI S3 Configuration values as per http://docs.aws.amazon.com/cli/latest/topic/s3-config.html.

The below increased an S3 sync speed by at least 8x!

Example:

$ more ~/.aws/config
[default]
aws_access_key_id=foo
aws_secret_access_key=bar
s3 =
   max_concurrent_requests = 100
   max_queue_size = 30000

s3cmd

sudo pip install s3cmd

# wildcard? Yes
s3cmd put file-* s3://logix.cz-test/

# metadata? yes
s3cmd put file s3://logix.cz-test/ --add-header x-amz-meta-profile-id:3733 

# limit upload bandwidth? Yes    
s3cmd put file-* s3://logix.cz-test/ --limit-rate=90k
 
      --limit-rate=LIMITRATE
                            Limit the upload or download speed to amount bytes per
                            second.  Amount may be expressed in bytes, kilobytes
                            with the k suffix, or megabytes with the m suffix


s3cmd put {} s3://alex-chute-data/   โ€“add-header='Cache-Control: public, max-age=31536000' 

s3cmd sync --delete-after  2 s3://dummy-test/
preserves directory hierarchy
does not delete anything after

s4cmd = s3cmd multithread

  • https://github.com/bloomreach/s4cmd
  • Latest commit f5f5ff0 on Feb 8, 2017
  • "Inspired by s3cmd, It strives to be compatible with the most common usage scenarios for s3cmd. It does not offer exact drop-in compatibility, due to a number of corner cases where different behavior seems preferable, or for bugfixes."
  • a python2/3 script in 1500 lines using boto3 : pip install s4cmd
s4cmd ls [path]

List all contents of a directory.

    -r/--recursive: recursively display all contents including subdirectories under the given path.
    -d/--show-directory: show the directory entry instead of its content.

s4cmd put [source] [target]

Upload local files up to S3.

    -r/--recursive: also upload directories recursively.
    -s/--sync-check: check md5 hash to avoid uploading the same content.
    -f/--force: override existing file instead of showing error message.
    -n/--dry-run: emulate the operation without real upload.

s4cmd get [source] [target]

Download files from S3 to local filesystem.

    -r/--recursive: also download directories recursively.
    -s/--sync-check: check md5 hash to avoid downloading the same content.
    -f/--force: override existing file instead of showing error message.
    -n/--dry-run: emulate the operation without real download.

s4cmd dsync [source dir] [target dir]

Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories.

    -r/--recursive: also sync directories recursively.
    -s/--sync-check: check md5 hash to avoid syncing the same content.
    -f/--force: override existing file instead of showing error message.
    -n/--dry-run: emulate the operation without real sync.
    --delete-removed: delete files not in source directory.

s4cmd sync [source] [target]

(Obsolete, use dsync instead) Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories. This command simply invoke get/put/mv commands.

    -r/--recursive: also sync directories recursively.
    -s/--sync-check: check md5 hash to avoid syncing the same content.
    -f/--force: override existing file instead of showing error message.
    -n/--dry-run: emulate the operation without real sync.
    --delete-removed: delete files not in source directory. Only works when syncing local directory to s3 directory.

s4cmd cp [source] [target]

Copy a file or a directory from a S3 location to another.

    -r/--recursive: also copy directories recursively.
    -s/--sync-check: check md5 hash to avoid copying the same content.
    -f/--force: override existing file instead of showing error message.
    -n/--dry-run: emulate the operation without real copy.

s4cmd mv [source] [target]

Move a file or a directory from a S3 location to another.

    -r/--recursive: also move directories recursively.
    -s/--sync-check: check md5 hash to avoid moving the same content.
    -f/--force: override existing file instead of showing error message.
    -n/--dry-run: emulate the operation without real move.

s4cmd del [path]

Delete files or directories on S3.

    -r/--recursive: also delete directories recursively.
    -n/--dry-run: emulate the operation without real delete.

s4cmd du [path]

Get the size of the given directory.

Available parameters:

    -r/--recursive: also add sizes of sub-directories recursively.

rclone

rclone uses a system of subcommands. For example

rclone ls remote:path # lists a re
rclone copy /local/path remote:path # copies /local/path to the remote
rclone sync /local/path remote:path # syncs /local/path to the remote

The main rclone commands with most used first

    rclone config - Enter an interactive configuration session.
    rclone copy - Copy files from source to dest, skipping already copied
    rclone sync - Make source and dest identical, modifying destination only.
    rclone move - Move files from source to dest.
    rclone delete - Remove the contents of path.
    rclone purge - Remove the path and all of its contents.
    rclone mkdir - Make the path if it doesnโ€™t already exist.
    rclone rmdir - Remove the path.
    rclone rmdirs - Remove any empty directories under the path.
    rclone check - Checks the files in the source and destination match.
    rclone ls - List all the objects in the path with size and path.
    rclone lsd - List all directories/containers/buckets in the path.
    rclone lsl - List all the objects path with modification time, size and path.
    rclone md5sum - Produces an md5sum file for all the objects in the path.
    rclone sha1sum - Produces an sha1sum file for all the objects in the path.
    rclone size - Returns the total size and number of objects in remote:path.
    rclone version - Show the version number.
    rclone cleanup - Clean up the remote if possible
    rclone dedupe - Interactively find duplicate files delete/rename them.
    rclone authorize - Remote authorization.
    rclone cat - Concatenates any files and sends them to stdout.
    rclone copyto - Copy files from source to dest, skipping already copied
    rclone genautocomplete - Output shell completion scripts for rclone.
    rclone gendocs - Output markdown docs for rclone to the directory supplied.
    rclone listremotes - List all the remotes in the config file.
    rclone mount - Mount the remote as a mountpoint. EXPERIMENTAL
    rclone moveto - Move file or directory from source to dest.
    rclone obscure - Obscure password for use in the rclone.conf
    rclone cryptcheck - Checks the integrity of a crypted remote.
rclone sync makes the DESTINATION match the SOURCE and therefore deletes files on DESTINATION if not in SOURCE.
https://forum.rclone.org/t/does-rclone-sync-only-delete-files-at-destination-and-never-source/785/3

# wildcard? No 
workaround : https://forum.rclone.org/t/how-to-use-wildcards/529

# metadata? not totally
https://forum.rclone.org/t/rclone-support-copying-object-source-metadata-and-acls/441 

# limit upload bandwidth? Yes, can be scheduled + be set through an ENV VAR
--bwlimit BwTimetable                 Bandwidth limit in kBytes/s, or use suffix b|k|M|G or a full timetable.
--bwlimit 1M
--bwlimit "10:00,512 16:00,64 23:00,off"

export RCLONE_BWLIMIT=50k
https://github.com/ncw/rclone/issues/1227
https://github.com/ncw/rclone/blob/b2a4ea9304644c6ed2a7f541562ecb93b79153c9/docs/content/docs.md#config-file

mc

https://github.com/minio/mc multipart, no metadata

ls       List files and folders.
mb       Make a bucket or a folder.
cat      Display file and object contents.
pipe     Redirect STDIN to an object or file or STDOUT.
share    Generate URL for sharing.
cp       Copy files and objects.
mirror   Mirror buckets and folders.
find     Finds files which match the given set of parameters.
diff     List objects with size difference or missing between two folders or buckets.
rm       Remove files and objects.
events   Manage object notifications.
watch    Watch for file and object events.
policy   Manage anonymous access to objects.
session  Manage saved sessions for cp command.
config   Manage mc configuration file.
update   Check for a new software update.
version  Print version info.

s3-parallel-put

s3-parallel-put --bucket=BUCKET --prefix=PREFIX SOURCE

Keys are computed by combining PREFIX with the path of the file, starting from SOURCE. Values are file contents.
Options:

    -h, --help โ€” show help message

S3 options:

    --bucket=BUCKET โ€” set bucket
    --bucket_region=BUCKET_REGION โ€” set bucket region if not in us-east-1 (default new bucket region)
    --host=HOST โ€” set AWS host name
    --secure and --insecure control whether a secure connection is used

Source options:

    --walk=MODE โ€” set walk mode (filesystem or tar)
    --exclude=PATTERN โ€” exclude files matching PATTERN
    --include=PATTERN โ€” don't exclude files matching PATTERN

s3-cli

  • https://github.com/andrewrk/node-s3-cli

  • inplace replace to s3cmd, written in Node : sudo npm install -g s3-cli

  • Latest commit 7697fd8 on Apr 21, 2017

  • Supports a subset of s3cmd's commands and parameters including put, get, del, ls, sync, cp, mv

  • When syncing directories, instead of uploading one file at a time, it uploads many files in parallel resulting in more bandwidth.

  • Uses multipart uploads for large files and uploads each part in parallel.

  • Retries on failure

put
Uploads a file to S3. Assumes the target filename to be the same as the source filename (if none specified)

s3-cli put /path/to/file s3://bucket/key/on/s3
s3-cli put /path/to/source-file s3://bucket/target-file

Options:
    --acl-public or -P - Store objects with ACL allowing read for anyone.
    --default-mime-type - Default MIME-type for stored objects. Application default is binary/octet-stream.
    --no-guess-mime-type - Don't guess MIME-type and use the default type instead.
    --add-header=NAME:VALUE - Add a given HTTP header to the upload request. Can be used multiple times. For instance set 'Expires' or 'Cache-Control' headers (or both) using this options if you like.
    --region=REGION-NAME - Specify the region (defaults to us-east-1)


get
Downloads a file from S3.

s3-cli get s3://bucket/key/on/s3 /path/to/file


del
Deletes an object or a directory on S3.

s3-cli del [--recursive] s3://bucket/key/on/s3/


ls
Lists S3 objects.

s3-cli ls [--recursive] s3://mybucketname/this/is/the/key/


sync
Sync a local directory to S3

s3-cli sync [--delete-removed] /path/to/folder/ s3://bucket/key/on/s3/

Supports the same options as put.

Sync a directory on S3 to disk

s3-cli sync [--delete-removed] s3://bucket/key/on/s3/ /path/to/folder/


cp
Copy an object which is already on S3.

s3-cli cp s3://sourcebucket/source/key s3://destbucket/dest/key


mv
Move an object which is already on S3.

s3-cli mv s3://sourcebucket/source/key s3://destbucket/dest/key

s3s3mirror

s3funnel

https://github.com/sstoiana/s3funnel "multithreaded command line tool for Amazon's Simple Storage Service (S3" Latest commit fe53718 on Feb 29, 2012

https://aws.amazon.com/code/s3funnel-multi-threaded-command-line-tool-for-s3/

$ s3funnel --help
Usage: s3funnel BUCKET OPERATION [OPTIONS] [FILE]...

s3funnel is a multithreaded tool for performing operations on Amazon's S3.

Key Operations:
    DELETE Delete key from the bucket
    GET    Get key from the bucket
    PUT    Put file into the bucket (key is the basename of the path)

Bucket Operations:
    CREATE Create a new bucket
    DROP   Delete an existing bucket (must be empty)
    LIST   List keys in the bucket. If no bucket is given, buckets will be listed.


Options:
  -h, --help        show this help message and exit
  -a AWS_KEY, --aws_key=AWS_KEY
            Overrides AWS_ACCESS_KEY_ID environment variable
  -s AWS_SECRET_KEY, --aws_secret_key=AWS_SECRET_KEY
            Overrides AWS_SECRET_ACCESS_KEY environment variable
  -t N, --threads=N     Number of threads to use [default: 1]
  -T SECONDS, --timeout=SECONDS
            Socket timeout time, 0 is never [default: 0]
  --insecure        Don't use secure (https) connection
  --list-marker=KEY     (`list` only) Start key for list operation
  --list-prefix=STRING  (`list` only) Limit results to a specific prefix
  --list-delimiter=CHAR
            (`list` only) Treat value as a delimiter for
            hierarchical listing
  --put-acl=ACL     (`put` only) Set the ACL permission for each file
            [default: public-read]
  --put-full-path       (`put` only) Use the full given path as the key name,
            instead of just the basename
  --put-only-new    (`put` only) Only PUT keys which don't already exist
            in the bucket with the same md5 digest
  --put-header=HEADERS  (`put` only) Add the specified header to the request
  --source-bucket=SOURCE_BUCKET
            (`copy` only) Source bucket for files
  -i FILE, --input=FILE
            Read one file per line from a FILE manifest
  -v, --verbose     Enable verbose output. Use twice to enable debug
            output
  --version         Output version information and exit

s3cmd-modification

https://github.com/pcorliss/s3cmd-modification https://github.com/pearltrees/s3cmd-modification https://github.com/pcorliss/s3cmd-modification/pull/2 "Modification of the s3cmd by s3tools.org to add parallel downloads and uploads. Forked as of revision 437. " Latest commit 9a65f78 on Aug 28, 2010