AWS Batch Integration - genepattern/genepattern-server GitHub Wiki

GenePattern on AWS is designed to make cloud computing effortless for bioinformatics developers. Analysis jobs run in Docker containers on Amazon's compute and data storage resources. Jobs are scheduled via the AWS Batch system. The entire GenePattern ecosystem is fully supported on the AWS cloud infrastructure.

AWS Batch

AWS Batch is Amazon's system for running hundreds of thousands of compute jobs on EC2 or Spot Instances without third-party or open source batch processing solutions. It is integrated with AWS data stores such as S3 and DynamoDB. Each job is run in a Docker image on AWS compute infrastructure.

Docker

"A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: OS, code, runtime, system tools, system libraries, settings. ... containerized software will always run the same, regardless of the environment."

HOWTO: configure AWS S3 storage

The GenePattern Server uses an S3 bucket for intermediate data storage. Data files are copied into s3 before aws batch job submission. They are copied from S3 into the docker container before running the module command line. When the command completes, output files are copied from the docker container into S3 as are libraries installed in the course of the job run. Finally, they are copied from S3 to the GenePattern Server head node.

Copies to/from S3 are done using the AWS CLI (Command Line Interface) 'sync' command. This uses checksums/digests to only copy files that are different so when the contents of a directory are unchanged, the cost is minimal. Also, since all copying is done within the AWS data center, copy operations are very fast. Initial benchmarking saw average speeds of ~45 GB/second with peak speeds up to 100 GB/s on occasion.

The aws-s3-root configuration parameter determines the path of the s3 files. This root prefix is applied uniformly to all local file paths on the GenePattern head node. Files are copied into s3 according to this template:

aws-s3-root=s3://<s3-bucket>[/<s3-prefix>]
S3Uri=<aws-s3-root><fq-file-path>

AWS S3 template:

aws s3 sync <LocalPath> <S3Uri> \
  [--exclude <exclude_pattern>] [--include <include_pattern>] [<aws-profile>]

See the AWS CLI Command Reference for detailed description of the S3Uri and LocalPath arguments.

S3Uri: represents the location of a S3 object, prefix, or bucket. This must be written in the form s3://mybucket/mykey where mybucket is the specified S3 bucket, mykey is the specified S3 key. The path argument must begin with s3:// in order to denote that the path argument refers to a S3 object. Note that prefixes are separated by forward slashes. For example, if the S3 object myobject had the prefix myprefix, the S3 key would be myprefix/myobject, and if the object was in the bucket mybucket, the S3Uri would be s3://mybucket/myprefix/myobject.

GenePattern wrapper template:

# Upload directory
#   set localPath to the fully qualified path to the directory
aws s3 sync <localPath> <aws-s3-root><localPath> \
    [--profile <aws_profile>]
# Upload file
#   set localPath to the fully qualified path to the parent directory of
#   the file to upload
aws s3 sync <localPath> <aws-s3-root><localPath> \
    --exclude "*" \
    --include "<filename>" \
    [--profile <aws_profile>]

Configuration parameters

name	description
aws-s3-root	the root prefix to apply uniformly to all local file paths on the GenePattern head node Example config, `aws-s3-root: s3://gpbeta`
job.docker.image	the docker image in which to run the batch job. This corresponds to the `IMAGE[:TAG\|@DIGEST]` option of the docker run command. This property was added to support the AWS Batch integration, but is intended to be used more general for docker enabled GenePattern instances. Example config, `job.docker.image: genepattern/docker-java17:0.12`

S3 Links

HOWTO: configure a module for Singularity (for example Carbonate on GP@IU)

starting with a module which is already working in Docker on AWS Batch (e.g NMFConsensus_v5)
determine which container to use (e.g. genepattern/docker-r-2-5)
download or select

HOWTO: configure a module for AWS Batch integration

declare runtime environment (e.g. python/3.6)
create or select a container (e.g. genepattern/docker-python36)
create or select an aws batch job definition (e.g. Python36_Generic)
create or select an 'executor.properties' entry in the config file
associate a module (by name or full lsid) with an executor.props entry in the config file

Edit the config_custom.yaml to associate a module (by name or lsid or lsid:version) to a particular docker image. Example module: txt2odf


name	txt2odf
commandLine	<python_3.6> ...
runtime environment	python/3.6
docker image	genepattern/docker-python36
job definition	Python36_Generic

Create or select an 'executor.properties' entry in the config file

executor.properties: {
    ...
    "python/3.6": { 
        job.virtualQueue: "python/3.6",
        aws-batch-job-definition-name: "Python36_Generic",
        "python_3.6": "python",
    },
    ...
}

Associate a module (by name or full lsid) with an executor.props entry in the config file

module.properties:
    ...
    "txt2odf":
        executor.props: "python/3.6"
    ...

Alternatively, you can directly associate a module with a job definition, e.g.

module.properties:
    ...
    "txt2odf":
        aws-batch-job-definition-name: "Python36_Generic"
    ...

Known Issues:

version management has not yet been implemented fully for the Docker containers
would like to make it easier to specify a particular version or range of versions by module name in the config_custom.yaml file
don't make things too complicated wrt the executor.props
we are not presently backing up the json representations of the aws batch job definitions
next step: we need to use versioned environments; so that we can pin a specific known good to a specific 'production' release of a module