AWS - VertebrateResequencing/vr-pipe GitHub Wiki

Amazon Web Services form Amazon's "Cloud", where you can run VRPipe, taking advantage of their compute nodes (EC2 instances), storage and databases. You pay only for what resources you use, and you can scale up to having as many nodes as you need, which makes it useful for quickly running VRPipe pipelines when you do not have the local computing infrastructure to handle the demand.

The following set of guides take you through every step needed to run VRPipe in Amazon's Cloud. If you're already familiar with AWS and know how to do certain things, you can skip the guides that are not VRPipe-specific. If you're already familiar with VRPipe, you can skip the guides that are not AWS-specific.

There are 4 special requirements to keep in mind for any successful VRPipe deployment (in the cloud or on local resources):

  • You need a MySQL database server that is both readable and writable from all nodes in your cluster.
  • You must have a shared filesystem, such that all nodes in your cluster can read and write to the same files (which is where your input and output data, and certain VRPipe directories should be stored).
  • It must be possible to ssh from any node in your cluster to any other node, without requiring user interaction.
  • For large clusters (more than hundreds of nodes), you should have a job scheduling system installed that is supported by VRPipe. Currently this means either LSF or SGE.

Getting Started with AWS

  1. Create an AWS account (not VRPipe-specific)
  2. Launch an EC2 instance with an unconfigured VRPipe AMI

Single Node Installation

VRPipe is mainly used to run "embarrassingly parallel" pipelines, where the more nodes in your cluster you have, the sooner your work completes. However, if you're just testing out VRPipe, or if you only plan on running very small/fast pipelines on very small datasets, you could run VRPipe on a single EC2 instance (eg. one with many cores and lots of memory).

In this situation most of the special requirements do not apply and you only need to worry about the database server. For very limited and light usage, you may not even need a MySQL server, instead using the simpler (zero-configuration) SQLite to handle VRPipe's database needs.

Assuming you're on an EC2 instance with an unconfigured VRPipe AMI you:

  1. Set up RDS (not VRPipe-specific, optional but recommended) or alternatively, if using SQLite or a locally installed MySQL, skip this step.
  2. Configure VRPipe using the database details from step 1 or by using SQLite, and the "local" scheduler.
  3. Install VRPipe (not AWS-specific)
  4. Create an AMI (not VRPipe-specific) of your running instance with a new name like 'my_single_instance_configured_vrpipe', so that if this one gets terminated you can immediately launch another instance that restores you to the point where you had completed steps 1-4.

Now you can transfer some input data to eg. the home directory on the launched instance and then use VRPipe normally. Just be careful to transfer any result files somewhere else before terminating the instance, or your data will be lost.

Multi-Node (Cluster) Installation

Before using VRPipe on a cluster you must satisfy the special requirements listed near the top of this page. If you don't know how to satisfy these requirements and launch and manage your own EC2 cluster (eg. using StarCluster or CloudFormation), follow these 4 guides in order:

  1. Set up RDS (not VRPipe-specific)
  2. Create a Gluster shared filesystem (not VRPipe-specific) (no need to carry out the last 2 steps of the last section, since you can use the instance in the next guide)
  3. Alter your SSH configuration (not VRPipe-specific) (continue to use this instance in the next guide)
  4. Configure SGE (not VRPipe-specific)

Starting with an EC2 instance running an AMI that satisfies the special requirements (eg. the instance you ended up on when you completed step 4 above):

  1. Configure VRPipe using your MySQL database details, and the "LSF" or "SGE" scheduler (or the "sge_ec2" scheduler if you followed guide 4 above)
  2. Install VRPipe on your shared filesystem (not AWS-specific)

(Since VRPipe was configured and installed on your shared disc, you don't need to create another AMI)

Now you can transfer some input data to your shared filesystem and then use VRPipe normally.

Note that by default Amazon limits you (and therefore VRPipe) to having a maximum of 20 running instances in each availability zone at any one time. You must fill out a form to ask for the limit to be raised.