Configuring direct uploads and downloads from S3 - genepattern/genepattern-server GitHub Wiki

Introduction

Recently we have had problems with the beta server with the addition of the new Salmon modules that either take many GB of inputs (e.g. job 64270 which took 16 fq.gz files at 1.5 to 2GP each plus an index of 19GB, around 48GB total of inputs) or outputs (e.g. the generation of that 19GB index).

This is a problem for the GP head node because with the existing implementation input files get transferred client -> headnode -> S3 -> compute node and then outputs compute node -> S3 -> headnode (and then maybe -> client)

This makes the head node shuffle the data twice inbound plus another 1-2 times outbound which causes its CPU to spike and performance to suffer.

To address this, we have added the ability to allow the GP client upload and download files directly from AWS S3 using presigned URLs. In this way the GP head node can be removed from the data flow path (helping performance) and up/down loads can use the AWS S3 machines which have much more bandwidth (also helping performance) with the added benefit of reducing the size of the local disk required for the head node (saving $$).

In addition, we modified the AWSBatchJobRunner (and GenePatternAnalysisTask) such that when a server is configured for S3 direct uploads and downloads, it also changes the handling of URLs passed to file parameters. Normally the GenePattern head node downloads the URL to a file and then passes all file parameters in the same way (nfs or S3). In the new setup, the head node does not download the file, and the AWSBatchJobRunner replaces the script to pull a file from S3 to one that will use wget to retrieve the URL. The upshot is that the compute node (and not the head node) now downloads the URL, and the S3 shuffle of the resulting file to/from S3 is eliminated on both the head and compute nodes.

The remainder of this doc explains the limitations and configuration requirements to enable this feature on a GenePattern server.

Limitations

the GenePattern server needs to be using S3 to share data to compute nodes
file uploads go direct to S3 only through the files pane and drop target on the GP home page

Inputs uploaded to file parameters from the python client still uses the old mechanism (for now)

Configuration

Enable CORS on S3 bucket

You must enable CORS on the S3 bucket used for this purpose (https://docs.aws.amazon.com/AmazonS3/latest/user-guide/add-cors-configuration.html). This is because the request will be originating as a beta.genepattern.org URL (or some other GP server URL and not as an S3 URL). https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-python3-boto3/

Install python boto3 package on GP head node

The scripts that make calls to generate presigned URLs for S3 resources are partly implemented in bash scripts (where simple) and partially in python scripts (for the multipart uploads where JSON manipulation is required since this is easier in python).

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-python3-boto3/

On a related note the */resources/wrapper_scripts/aws_batch/init-aws-cli-env.sh. needs to be setup properly to ensure that the AWS CLI is available, the correct credentials are default (unless on an AWS node with a IAM role delegated to it) and that the python with boto3 is the first python on the path (e.g. by sourcing a conda env if needed). (on beta it uses "source /home/gpserver/genepattern_py_venv/env/bin/activate")

Copy the gp-awsbatch-0.4.5-snapshot.29.jar (or later) jar

The implementation is first available in the gp-jobrunner-awsbatch codebase with the snapshot.16 release (S3 up/down) or the snapshot.29 release (URL parameter handling). Later versions are assumed to also have the updates necessary.

Ensure that gp-awsbatch-0.4.5-snapshot.29.jar is in /Tomcat/webapps/gp/WEB-INF/lib and that older versions of the jar file are removed or have their extensions changed from ".jar" to prevent classpath oddities.

Also copy the wrapper scripts for the same version to /resources/wrapper_scripts/aws_batch. The scripts can be copied from beta or git at https://github.com/genepattern/genepattern-server/tree/develop/gp-jobrunner-awsbatch/src/main/scripts. The scripts that must be present for this are

presignUpload.sh
presignUpload.py
completeUpload.sh
init-aws-cli-env.sh (this one is needed for any AWS batch use)

While its not in a separate script (was too simple to need that) the S3 direct downloads will also call "aws s3 presign " to generate pre-signed download URLs.

Set properties in custom_config.yaml

direct_external_upload_trigger_size:  1

"aws-s3-root": "s3://moduleiotest/gp-dev-ami"

aws-batch-script-dir: "/opt/gpbeta/gp_home/resources/wrapper_scripts/aws_batch/"

upload.aws.s3.presigning.script: "presignUpload.sh"

external.file.manager.class: "org.genepattern.server.executor.awsbatch.AWSS3ExternalFileManager"

The details of these properties are discussed below.

Common Configuration Properties

"aws-s3-root": "s3://moduleiotest/gp-dev-ami"

The url to the bucket and a directory within it that will be used for the root location within S3. This value was already defined for the AWSBatch Job Runner but for the direct up/down loads, it must be moved to the general part of the config file and not buried in the job runner config (where only the jobrunner can access it)

aws-batch-script-dir: "/opt/gpbeta/gp_home/resources/wrapper_scripts/aws_batch/"

The path (on the headnode) to the directory holding the scripts. This value was already defined for the AWSBatch Job Runner but for the direct up/down loads, it must be moved to the general part of the config file and not buried in the job runner config (where only the jobrunner can access it)

Upload specific Configuration Properties

direct_external_upload_trigger_size: 1

Units: bytes

Files dropped on the drop target, smaller than this size use the old multi-part upload system (based on ResumableJS). To Only use the old system set this to -1. To upload all files using S3, make this 0 or 1. To only direct upload large files enter a large number.

For the first implementation you must enter an int (i.e. it will not work saying "10 MB" you must have "10485760" as the value)

upload.aws.s3.presigning.script: "presignUpload.sh"

The name of the script (found in the aws-batch-script-dir) used for generating upload URLs. It will be given an input like this { "bucket": "moduleiotest", "path": "some/path/filename.txt", "numParts": 2, "contentType": "text/plain" } and will return the JSON object from the AWS create-multipart-upload command with the addition of an array of presigned URLs at a property of the json object called "presignedUrls" {"ResponseMetadata":{"RequestId":"D6F8C3DC81EA644B", "HostId":"8EewffBTa2sUKoRrghlRQi1rnMN47Se6H635nYQ5bpoOesUYP9p3Wq7aNbzBMitZGO8h0aU/QHo=","HTTPStatusCode":200,"HTTPHeaders":{"x-amz-id-2":"8EewffBTa2sUKoRrghlRQi1rnMN47Se6H635nYQ5bpoOesUYP9p3Wq7aNbzBMitZGO8h0aU/QHo=","x-amz-request-id":"D6F8C3DC81EA644B","date":"Thu, 28 Jan 2021 22:23:57 GMT","x-amz-abort-date":"Wed, 03 Feb 2021 00:00:00 GMT","x-amz-abort-rule-id":"drop incomplete multipart uploads","transfer-encoding":"chunked","server":"AmazonS3"},"RetryAttempts":0},"AbortDate":"2021-02-03T00:00:00+00:00","AbortRuleId":"drop incomplete multipart uploads","Bucket":"gp-temp-test-bucket","Key":"tedslaptop/Users/liefeld/Documents/workspace-gpserver/.metadata/.plugins/org.eclipse.wst.server.core/users/ted/uploads/foofoo/GPserver.b309.bin","UploadId":"3uWlUIqTNxeXkHE49hNJBC4L0cCwg7bLd2cyhLmgjWgtnm9z.TJzJx1MdJ0iwFotFPUgleu0mgPkxfPnbM4APQ05GiqOPLOUl7VoyEKjorm6S02J9cgILRHV_3p.jUdV","presignedUrls":["https://gp-temp-test-bucket.s3.amazonaws.com/tedslaptop/Users/liefeld/Documents/workspace-gpserver/.metadata/.plugins/org.eclipse.wst.server.core/users/ted/uploads/foofoo/GPserver.b309.bin?uploadId=3uWlUIqTNxeXkHE49hNJBC4L0cCwg7bLd2cyhLmgjWgtnm9z.TJzJx1MdJ0iwFotFPUgleu0mgPkxfPnbM4APQ05GiqOPLOUl7VoyEKjorm6S02J9cgILRHV_3p.jUdV&partNumber=1&AWSAccessKeyId=AKIA2OLTZKPMVD2WZ2UW&Signature=Zjwpp7wCEDyVGQIpisJoAdk5BxA%3D&Expires=1611876236","https://gp-temp-test-bucket.s3.amazonaws.com/tedslaptop/Users/liefeld/Documents/workspace-gpserver/.metadata/.plugins/org.eclipse.wst.server.core/users/ted/uploads/foofoo/GPserver.b309.bin?uploadId=3uWlUIqTNxeXkHE49hNJBC4L0cCwg7bLd2cyhLmgjWgtnm9z.TJzJx1MdJ0iwFotFPUgleu0mgPkxfPnbM4APQ05GiqOPLOUl7VoyEKjorm6S02J9cgILRHV_3p.jUdV&partNumber=2&AWSAccessKeyId=AKIA2OLTZKPMVD2WZ2UW&Signature=KCOcIP4u4EFqi64OG2iixHC%2BBn4%3D&Expires=1611876236"]}

Download specific Configuration Properties

external.file.manager.class: "org.genepattern.server.executor.awsbatch.AWSS3ExternalFileManager"

Previously this used the key download.aws.s3.downloader.class: "org.genepattern.server.executor.awsbatch.AWSS3ExternalFileManager" which is now obsolete but still necessary until the next release (after JobRunner v29 and GenePattern server b322)

This defines the class that will handle doing the downloads, uploads and URL retrievals. If it is not set then the old (direct) download servlet will still do the work. If it is set the HTTPUrlConnection to the GP server is sent to this class instead. For the class here (AWSS3ExternalFileManager) it will generate a presigned URL to S3 and then do a redirect (301) to the new URL.

When a job completes, the AWSBatchJobRunner class, instead of sync'ing the files from S3 to the local drive, will instead write a file called ".non.retrieved.output.files.json" which GenePatternAnalysisTask will read and use to create the database records for the output files.

NOTE An important thing to be aware of is that when the redirect happens, the presigned AWS URL has all the necessary authentication included in query parameters. Some HTTP clients (Python's urllib and the java Apache classes) will include the GenePattern basic auth token for this call (which will generate a 400 error on S3 saying to use only a single authentication method). Other clients (any web browser, Java's native URLConnection class) do not pass the basic auth header on to S3 and do not experience this problem.