AWS Batch Configuration - genepattern/genepattern-server GitHub Wiki

AWS Batch consists of elements that are similar in both name and function to GenePattern concepts. In the following I am always referring to the AWS concepts unless I specifically name it as a GenePattern object. So a "Job" is an AWS Job, while a "GenePattern Job" is a GenePattern Job (which will have delegated execution to an AWS Job).

Please familiarize yourself with AWS Batch and GenePattern before continuing. Familiarity is assumed below.

The Flow

On AWS

GenePattern Jobs are submitted as AWS Jobs to an AWS Job Queue via a single standardized AWS Job Definition. The Job Queue sends the job to one of the Compute Environments. The Compute Environment (CE) starts the Job on an EC2 Instance in an EC2 Autoscale Group (that was auto-created for the CE).

GenePattern's custom bit

The AWS Job runs a GenePattern docker-in-docker container (from the Job Definition) in Docker on the EC2 Instance. This container creates necessary directories for the job on the instance and S3 sync's them to the instance. It then launches the module container with a sleep command (so it cannot run forever), followed by an exec of the module command line. While it waits for either the module to finish or the sleep to timeout it synchs the stdout and stderr files back to S3 so they will be available in the case of a crash. Once the module finishes or times out it sync's back the job directory to S3, tells docker to clean up the image, deletes the job inputs and outputs from the EC2 node and exits causing the AWS Job to finish.

Back on AWS

Aws sees the GenePattern container finish and updates the CloudWatch logs for the job and then updates Batch to show the job is done.

AWS Batch Configuration

Job Definition

The significant parts of the AWS batch job definition are the container genepattern/dind:0.7 (or a later version), the command (/usr/local/bin/copyFromS3ThenRun.sh) and volume mounts (/var/run to RUN and /local to LOCAL).

{ "jobDefinitionName": "S3ModuleWrapper", "jobDefinitionArn": "arn:aws:batch:us-east-1:718039241689:job-definition/S3ModuleWrapper:18", "revision": 18, "status": "ACTIVE", "type": "container", "parameters": { "exe1": "-u", "s3_root": "noSuchBucket", "inputFileDirectory": "job_1X", "working_dir": "job1_X", "taskLib": "src" }, "retryStrategy": { "attempts": 1 }, "containerProperties": { "image": "genepattern/dind:0.7", "vcpus": 2, "memory": 300, "command": [ "/usr/local/bin/copyFromS3ThenRun.sh" ], "jobRoleArn": "arn:aws:iam::718039241689:role/BATCH-EFS-ROLE", "volumes": [ { "host": { "sourcePath": "/var/run" }, "name": "RUN" }, { "host": { "sourcePath": "/local" }, "name": "LOCAL" } ], "environment": [ { "name": "GP_JOB_CONTAINER_DONT_USE_CACHE", "value": "TRUE" } ], "mountPoints": [ { "containerPath": "/var/run", "readOnly": false, "sourceVolume": "RUN" }, { "containerPath": "/local", "readOnly": false, "sourceVolume": "LOCAL" } ], "ulimits": [], "resourceRequirements": [] } }

Job Queues

GenePattern Cloud uses the Job Queue called "gp-cloud-default". This has 2 compute environments defined, one small one that uses a reserved instance that is always up and running, and a second that uses SPOT instances that defaults to zero vCPU.

The small one is first in the order.

Compute Environments

The current names are "small-30gb-base-6" and "large-elastic-genepattern-6" henceforth called the "persistent" and "elastic" CEs respectively.

The persistent CE has min/desired/max vCPU of 2/4/6. The elastic CE has min/desired/max vCPU of 0/0/256.

The way it should work is that jobs initially go to the persistent queue. If its node is busy it then starts up to 2 more and then overflows to the elastic CE as more jobs come in.

Originally persistent was set as min/desired/max of 2/2/2 but this had the undesirable side effect that the as the node aged, eventually docker overflowed the hard drive and it would be unable to run further jobs. Since the jobs failed instantly they never overflowed to the elastic CE since there was only ever 1 (briefly) running. To fix this we would manually kill the compute node and the new one would usually be fine for a week or two. and then it would repeat.

To fix this, we changed the batch CE to 2/4/6. We manually edit the autoscale group to set the "Max Instance Lifetime" to its minimal value of 604800 seconds (one week). For this to work though it will not kill/replace nodes because the AWS implementation will not remove an old node unless it can replace it first and not interrupt service. This means the minimum "max" value of instances for the autoscale group must be 3 or greater with a min of 1. For this to work also the autoscale group has to change from using a Launch Configuration to using a Launch Template. So to do this we manually created a launch template using the same settings as the auto-generated launch Configuration. Finally this also gives us the option to mix both EC2 Reserved and Spot instances in the autoscale group (and thus the CE) so we configure it such that the first instance is the RI (and thus cannot be interrupted) by setting the "optional on-demand base" parameter to "1", the on demand percentage to "34%" since we cannot do 33.33333.

With these settings the result is that as each persistent compute node reaches a week of age, it is automatically recycled by starting another node to replace it, and at any time one of the nodes is a RI while the remainder are spot instances.

The elastic CE is just configured normally as a SPOT CE and no manual reconfiguration of the autoscale group/launch configuration is needed.