Running on EC2 - adarob/eXpress-d GitHub Wiki

A guide to running eXpress-D on Amazon's Elastic Compute Cloud (EC2).

Launch a Cluster

Setup an EC2 account on the Amazon Web Services (AWS) website.

The express-d/ec2-scripts directory contains a copy of Spark's EC2 scripts that launches a cluster and sets up Spark and Hadoop HDFS on it. These EC2 scripts are described in detail in Running Spark on EC2. What follows in this wiki page is a summary of key points.

To launch a fresh cluster, do:

$ cd express-d/ec2-scripts
$ ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> -a ami-0b6d2662 launch <cluster-name>

Where ...

<keypair> is the name of your EC2 keypair (i.e., the keypair filename without the .pem suffix).
<key-file> is the private keypair file for the keypair.
<num-slaves> is the number of slave instances to launch.
<cluster-name> is the name to give to your cluster. This will be shown on the Spark WebUI.

and ...

-a ami-0b6d2662 specifies a pre-built machine image that includes Spark, Hadoop, and eXpress-D sources. This image is used to launch each slave instance.

Another optional, yet very useful, option is -t <instance-type>, which allows the user to specify what type of instances to launch (link to Instance Types and Instance Costs).

The spark-ec2 script will take a while to complete setup. If at any time the setup process is interrupted (e.g. you accidentally closed a Terminal window while the process is running), then you can resume setup by passing the --resume flag to the command above.

spark-ec2 can also be used to stop, resume, or terminate a cluster (see spark-ec2 --help).

Setup eXpress-D on the Cluster

Once spark-ec2 has finished cluster setup, SSH into the master.

 $ ./spark-ec2 -k key -i key.pem login <cluster-name>

To setup eXpress-D on the master, follow the steps on the Setting Up and Running eXpress-D page. The bin/build scripts will detect the cluster's slave instances and copy the express-d directory to each one.

After the express-d/config/config.py file has been configured and eXpress-D and Spark have been packaged, eXpress-D is ready to be run (but read the next section before actually running).

Using Provided (Simulated) Datasets

There is a small, simulated dataset in /root/sample-datasets to play with. But before running eXpress-D, these datasets must be loaded onto HDFS or Amazon S3. To load the targets and alignments files into HDFS, run:

$ /root/bin/hadoop/dfs -put <path/to/targets.pb> /targets.pb
$ /root/bin/hadoop/dfs -put <path/to/hits.pb> /hits.pb

HDFS is sufficient for most use EC2 use cases. If any dataset will accessed multiple times by different clusters running eXpress-D, then it might be more convenient to use Amazon S3 (i.e. storing sporadically accessed data on S3 is cheaper than storing data in an HDFS that is running on a managed EC2 cluster). However, loading into Amazon S3 is tricky and, when we did so for the eXpress-D paper, involved using Spark to load an RDD/dataset from HDFS, and then calling RDD#saveAsTextFile(s3n://...) on it.

Running eXpress-D

After the datasets have been loaded onto HDFS, make sure that config,py contains the HDFS path for each file before calling bin/run. In config.py:

...
EXPRESS_RUNTIME_LOCAL_OPTS = [
    OptionSet("hits-file-path", ["%s:9000/hits.1M.pb" % SPARK_CLUSTER_URL]),
    OptionSet("targets-file-path", [%s:9000/targets.pb" % SPARK_CLUSTER_URL]),
    ...