AWS Batch AMI Creation - genepattern/genepattern-server GitHub Wiki

Introduction

The GenePattern use of AWS Batch is not ideally suited to using their generic "ECS Optimized Amazon Linux AMI". Said AMI is launched with an 8GB root volume and a 22 GB EBS volume. Docker containers are launched with a base image size (max) of 10GB. So what this means is that the sum of all input and output files must be below 22GB for all jobs on any compute node and the sum of all containers and images must be < 8GB (not counting overhead).

These disk space limitations are almost viable if compute nodes are started and stopped after each job, but the current configuration keeps one compute node running 24/7 in order to reduce latency, and this node can end up running hundreds of jobs and thus can easily run out of disk space. Also, for some modules (e.g. anything RNA-seq or NGS) the max of 10GB of intermediate files can be an issue.

To deal with this, a few approaches are happening in parallel (mentioned here just for informational purposes.

The DinD (Docker in docker) outer container cleans up after executing the module. It removes any stopped containers from the compute node docker, and removes job inputs/outputs after they have been synch'd to S3.
There are cron (AWS Lambda) jobs that attempt to shutdown and restart hot compute node that is kept up 24/7. These happen once at night right before the nightly tests. Again after the nightly tests complete, and also around 12pm noon PST. If a job is currently executing when the restart cron is run, it aborts, leaving the node up and unchanged until its next scheduled restart.
We increase the disk space on the AMI used for compute nodes, and also the docker base size (for files internal to the container, such as job intermediate files).

Configuration of the AMI for #3 above is the subject of the rest of this page.

Configuration of an AWS Batch AMI for GenePattern

For the instructions below, I will assume you know how to launch EC2 instances from the AWS console, save an instance as a new AMI and configure AWS Batch Queues and Compute Environments. Google for AWS documentation to learn how to perform these activities.

1. Manually launch an EC2 Instance to become the new template

We start by manually launching an EC2 Instance that we will configure and then save as our new compute node AMI.

Begin with the current "Amazon ECS-Optimized Amazon Linux AMI". To find this, paste the string "Amazon ECS-Optimized Amazon Linux AMI" into the search box when picking an AMI to launch. You should be presented with the latest version of the ECS ami which we will use as the starting point.

Proceed through the launch dialog. In the Storage tab, add extra space for both the root and the EBS volumes associated with your instance.

    Root:  /dev/xvda   80GB   <- used for input and output files
    EBS  /dev/xvdcz    100 GB    <- used for docker containers and images

Continue to launch the AMI in a security group and with a key that you can use to access it later.

Update Docker configuration

ssh into the instance (as ec2-user) that you just launched. then

    >> sudo vi /etc/sysconfig/docker

and in this file, add "--storage-opt dm.basesize=30G" to the OPTIONs. Also (optionally) increase the number of file handles a container is allowed to use (in case a module opens a lot of files). At the time of this writing this will now look like this

      ``OPTIONS="--default-ulimit nofile=2048:8192   --storage-opt dm.basesize=30G"

Stop and restart the docker daemon (dockerd) process. Use "ps -ef | grep dockerd" to get the command line and process id. use "kill -9 " to stop the daemon, and then restart it with "sudo <docker command from ps- -ef>.

Verify that the new base image size is set properly. Run "docker system info " and confirm that it has "Base Device Size: 32.21 GB" and "Data Space Total: 87.07 GB". Note that it uses base 10 not base 2 to get GB values so these may be slightly different than the exact numbers you gave it when launching and editing the file.

Save the Image & Update Batch

From the AWS Console, select the EC2 Instance you have been working in and then choose the "create AMI" option. Provide a sensible name for the AMI (e.g. BatchComputeNode_101118 was the one I saved on Oct 11, 2018).

In AWS Batch, you will now need to create a new Compute Environment that uses this AMI. Since you cannot edit the CE AMI, or clone a CE, open a second browser window so that you can easily copy the values used in the old CE (e.g. 30gb-base-3-expanded-local). Select the new AMI otherwise use the same values.

Still in batch, update the Queue (e.g. gpbeta-default) to add in this new CE and remove any old one(s) that are no longer desired. Make sure to set the desired count to 0 for the old CE, and then to "disable" and (optionally) "delete" it from the console.

Update the cron jobs

In the AWS Console go to CloudWatch and then the "Rules" tab.

For the following three rules, update the compute environment ID.

cron_for_batch_compute_restart
cron_for_batch_restart_2
retest_of_cron_restart

(ASIDE: Its a PITA that you cannot either copy or rename a rule once created)

Select the rule, and "edit"
In the "targets" section open the "Configure Input" and select "Constant (JSON)" if it is not already
Update the computeEnvironmentId to the new CE name
Save

Remember to check the age of the running compute node instance in the AWS console after the cron schedules to see that the node was indeed restarted. If it was not, check the CloudWatch logs for the lambda's to see if the ID was entered incorrectly (note cut-and-paste the ID from the console and you may have leading and trailing blank spaces in the id) or if it just failed to restart due to a job running at the time it was triggered.

References:

https://docs.aws.amazon.com/batch/latest/userguide/compute_resource_AMIs.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html