Configure VRPipe - VertebrateResequencing/vr-pipe GitHub Wiki

Before VRPipe can be used it must be configured. This process tells VRPipe about essential things like the database connection details and where to store its own working files. Completing the process generates a SiteConfig.pm perl module which will be installed along with all the other VRPipe modules when you install it.

(If you are on a shared system where multiple different teams may want to use VRPipe independently, by installing to your own area that is only in your own team's PERL5LIB, you can have a VRPipe configuration that is unique to your team, such that VRPipe will connect to your team's database and not anyone else's.)

Another time you may need to configure VRPipe is when you upgrade to a new version that has new options to set. The IMPORTANT_NOTES file in the VRPipe repository will point out if there are new options to consider. The procedure for updating your configuration is identical to the first-time configuration explained below, except that you will find your previous configuration choices already filled out as defaults, which you can keep by just hitting return.

Before starting you should already have completed the Install VRPipe Dependencies guide. Some example command lines given below assume you are using our public unconfigured VRPipe AMI on an Amazon EC2 instance; adjust them (eg. change directory and file locations) as appropriate for your working environment.

  1. (optional) Create a directory where various VRPipe-related things will go, that you can use as the parent directory of some of the answers you will answer later:
    mkdir $SOFTWAREDIR/VRPipe
  2. Go to the root directory of your clone of the VRPipe git repository (where the README file is).
    cd $SRCDIR/vr-pipe/
  3. If you have used VRPipe in production before, note which version you're currently using:
    grep dist_version Build.PL
  4. Update to the latest stable version:
    git checkout master; git pull
    If you have used VRPipe in production before and so have a production database already, be sure to follow all the upgrade advice given in the IMPORTANT_NOTES file, checking the instructions for all versions since the version you noted in step 2.
  5. Run the Build.PL script to generate (or update) your SiteConfig.pm configuration file:
    perl Build.PL
    The script will interactively ask you a series of questions which you should answer in full; see the following section for advice on each question. Always provide absolute paths when asked for a directory or file. Before asking questions it may point out that you are missing some pre-requisite CPAN modules; install those first with ./Build installdeps before continuing. If you have configured VRPipe before it will ask if you want to go through setup again; answer y if the IMPORTANT_NOTES indicated there are new questions to answer since your last version of VRPipe.
  6. Make sure your test database has been created. Eg. connect to your database using the mysql client and do create database [name];, where [name] is the name you configured during step 5 for your testing database.
  7. Test that VRPipe works with this configuration by running the test suite.
    ./Build test
    Currently there are 2 issues with the suite: the first time you ever run it against a new database it will throw up a big error and then fail on certain tests; to avoid this, just let it get as far as DataSource.t, kill it with ctrl-c and then try again. As part of the whole suite, the Living.t test may fail, but should work when run by itself with ./Build test --test_files t/VRPipe/Living.t. Some tests can be slow (~20mins), especially if you only have a few cores to run tests on (eg. you're using the local scheduler). However if things seem to get stuck for long periods of time, check the log file in the logging directory you chose in step 5 for errors.

Build.PL questions

When you run perl Build.PL in step 5 above you will be asked a series of questions. Most of them should be self-explanatory and easy to answer, but this section goes into detail to explain each of them and how best to answer them.

Passwords you enter will be encrypted; where should your encryption key be stored? (it is up to you to properly secure this file)

Depending on your database type, you may need to enter a password in a subsequent question. Rather than store your database password in plain text in the SiteConfig.pm file, VRPipe will encrypt it with a key that will be saved to the file you answer with here. This file should be made readable by yourself and others that you want to be able to use your installation of VRPipe, but should not be readable by anyone else. For example, you could answer /home/ec2-user/software/VRPipe/.siteconfig_key or /shared/software/VRPipe/.siteconfig_key if on a shared disc, then after configuration is complete alter the permissions of that file (if it exists) as appropriate, eg. chmod o-r /home/ec2-user/software/VRPipe/.siteconfig_key.

Where should schema files be stored to enable schema upgrades?

When you first create a production VRPipe database (after installation of VRPipe), the database schema is stored in a certain directory structure. If a future VRPipe upgrade involves a change to the database schema, a supplied script is able to use the files in that directory to figure out how to automatically upgrade your existing database to the new schema. So your answer to this question needs to be a directory that you will keep long-term; do not delete it! An example answer is /home/ec2-user/software/VRPipe/schema_upgrades or /shared/software/VRPipe/schema_upgrades if on a shared disc.

What DBMS should be used for production?

'DBMS' stands for 'DataBase Management System'. VRPipe is designed to be independent of the DBMS, potentially able to work with any that are supported by the perl module DBIx::Class, as long as a supporting VRPipe::Persistent::Converter:: module has been written. These are relatively trivial to write, though currently only ones for MySQL and SQLite have been written. Due to the heavy write concurrency from potentially thousands of different processes at once, it is not currently recommended you use SQLite as the DBMS for VRPipe. MySQL 5.5 is what we use and is the most well tested in production. SQLite is easier (and cheaper, in the cloud) to use and manage, though you'd have to arrange your own way of backing up the database file. It could be sufficient for small deployments of VRPipe, where only a few CPU cores are running jobs at any one time.

Depending on the DBMS you choose, the next questions you are asked varies. With sqlite, you are just asked for the database 'name', and your answer should be the path to a file on disc, where the database will be stored, eg. /home/ec2-user/software/VRPipe/database_production.sqlite. Make sure this file is never deleted, or VRPipe will forget all the work it has done in the past, potentially leading to the repetition of work and the definite loss of many of VRPipe's most useful features.

For other DBMSs (currently just mysql) you will be asked for the name, host, username and password needed to connect to the database. These are the same details you would normally need to interact with your database via a database client such as mysql or phpmyadmin et al. If you have everything except the name because you haven't yet created a database for VRPipe, call it something like vrpipe_production and then use your usual method to create an empty database with that name after completing configuration. Note that the user needs the permissions to create and drop database tables, in addition to all the usual permissions needed for reading and writing. In MySQL it may be most convenient if the user has 'SUPER' privileges, though for day-to-day running only select,insert,update,delete,lock tables,execute,create temporary tables are required (dropping and creating tables is needed by the vrpipe-db_deploy and vrpipe-db_upgrade scripts, which are only used after installing a new version of VRPipe).

What DBMS should be used for testing?

The same notes apply as for for the production DBMS above. You should use the same DBMS for testing as you will use for production, so that the tests properly reflect how production will behave. However be certain that the database 'name' is different for testing, eg. with SQLite use /home/ec2-user/software/VRPipe/database_testing.sqlite and with MySQL use vrpipe_testing. Other details can be the same.

When a test script starts it drops all tables so that it starts with a fresh VRPipe database. That is why you can't have the same database for production as you use for tests, and it would also cause problems if there are multiple people in your team running tests at the same time (eg. they're developing and testing 2 new pipelines). To solve this you could use an environment variable to specify the database name, eg. ENV{vrpipe_testing_db}. Now each person would include something like export vrpipe_testing_db=vrpipe_testing_myusername in their ~/.bash_profile. So long as a database with that name has been created, when they run tests VRPipe will connect to their own personal database and not interfere with what anyone else is doing.

What job scheduler should be used for production?

A job scheduler (aka distributed resource manager) manages the compute resources in your cluster, queuing and running jobs on nodes that have sufficient available resources (CPU and memory), as they become available. VRPipe gets its jobs run by submitting them to your job scheduler. If you have LSF or SGE configured on your local cluster, choose LSF or SGE as appropriate.

If you're just testing and trying out VRPipe for the first time and don't want to commit to using up cluster resources, or if you're running in the cloud where you don't want to spend extra money running more than one instance, you can use the local scheduler. This scheduler is a very basic and runs jobs directly on the local CPUs on a 'when possible' basis. It holds no state and no queue, so it is possible for a Submission with high CPU or memory requirements to 'pend' for ages (or even forever if the requirements exceed the capability of the local machine). Thus this scheduler is primarily intended for testing purposes only, though could be used in production on a powerful enough machine. When using the local scheduler it is recommended to start the VRPipe server with --max_submissions set to 1 less than the number of cores in your machine.

When using Amazon's EC2 cloud you can use one of the ec2 schedulers. ec2 is a basic scheduler that avoids the need for a third party job scheduler. Instead VRPipe itself will monitor how much CPU and memory is available on instances running the same AMI, then runs jobs on those with sufficient available resources via SSH. It does load-balancing based on the number of jobs that need running by launching and terminating instances as necessary. Because of the way it must constantly poll your running instances, this scheduler may not scale well beyond 10s of instances.

If you're planning on running 100s+ of instances we recommend using the sge_ec2 scheduler instead. This combines the normal SGE scheduler with the ec2 scheduler's load-balancing feature. You must have SGE installed, and you should note that it alters configuration of your SGE installation, changing things like your complex attributes, parallel environment, queues and groups (and so the user that runs vrpipe-server must have permission to do these things).

Both ec2 schedulers assume that all instances running the same AMI as the instance the vrpipe-server is running on are under its full control, and so it can terminate them at any time; do not manually launch an instance using that AMI yourself. Choosing either of these schedulers will result in additional questions being asked. You will be asked for your AWS access and secret keys; if you don't know these follow the AWS IAM guide. When asked "What is the url for your ec2 region?", use a similar url to the default, changing only the eu-west-1 part to match the region your VRPipe instance is in (eg. US West Oregon is us-west-2, US West N. California is us-west-1 and US East N. Virginia is us-east-1). When asked "What percentage of the on-demand price should spot bids be placed at?", answering 0 (or over 100) turns of spot requests meaning you pay the on-demand (highest) prices for your instances, but you get them faster and keep them for as long as they're needed. Answering 100 means you bid at the on-demand price but pay the current spot price which could be much lower; as demand rises you pay the higher price until demand is so high that it exceeds the on-demand price and Amazon terminates the instance. Answering some lower percentage ensures you never pay as much as the on-demand price, but increases the chance that your instance gets terminated, and risks you never being able to launch instances if the spot price is always higher than your desired percentage of the on-demand price. Answering 0 to the follow-up question ("If the resulting spot bid price is lower than the lowest successful bid of the past 24hrs [...]") avoids this risk by bidding the lowest spot price in that siutation instead. The remaining ec2-related questions should be self-evident.

If using the sge_ec2 scheduler you will also be asked for the absolute path to your SGE config file. This is the file that was used to configure and install SGE on your EC2 instance. If you didn't use a config file to automate that process you can copy $SGE_ROOT/util/install_modules/inst_template.conf, alter the various variables to match your SGE installation, and then use the new file path as your answer to this question. If you're following our other EC2-realted guides and are using our public VRPipe AMI, the answer to this question is /shared/software/VRPipe/sge/install.conf.

What job scheduler should be used for testing?

Typically you should use the same scheduler that you use for production, for a proper test. However once initial testing is over, you may prefer to use the local scheduler for testing so that tests do not have to wait to get their jobs scheduled on job scheduler kept too busy by production work (on the other hand, tests will be very slow since your local machine won't have many cores compared to your cluster). To make it easy to change your mind about this answer, consider making it an environment variable. eg. set the answer here to ENV{vrpipe_testing_scheduler}. Now in your ~/.bash_profile put export vrpipe_testing_scheduler=local. Just change the 'local' to what your production scheduler is and source ~/.bash_profile any time you want to change your mind.

What directory should production logs, job STDOUT/ERR and temp files be stored in?

VRPipe stores various internal working files, including log and pid files, in the directory you supply here. The directory must be writable to by all nodes, ie. be on a shared filesystem. If you're using the local scheduler this restriction doesn't apply, so you could answer something like /home/ec2-user/software/VRPipe/production. Otherwise your answer might be something like /shared/software/VRPipe/production.

What directory should testing logs, job STDOUT/ERR and temp files be stored in?

See the notes for the previous question. The testing directory should be different than the production directory, and if you will have multiple different people running tests at the same time, should be different for each person. This can be done with environment variables. eg. set the answer here to ENV{vrpipe_testing_logging_directory}. Now in your ~/.bash_profile put export vrpipe_testing_logging_directory=/shared/software/VRPipe/testing_myusername. If you're using the local scheduler by yourself you can just use a direct answer like /home/ec2-user/software/VRPipe/testing.

What port will the VRPipe interface be accessible on, when accessing your production database?

VRPipe has both browser and cmd-line interfaces, which communicate with vrpipe-server via the port you specify here. The default port is randomly chosen for you, or you can pick a port that you know isn't used by anything else you run. [hit return to accept the default]

What port will the VRPipe interface be accessible on, when accessing your testing database?

Like the previous question, only this interfaces you to your testing database. By default the port is 1 greater than the default production port. If you have multiple users and therefore multiple testing databases, use an environment variable for the answer, eg. ENV{vrpipe_testing_interface_port}. Now in your ~/.bash_profile put export vrpipe_testing_interface_port=xxxxx, where xxxxx is an unused port number unique to you.

What port should the production redis-server listen on?

VRPipe launches a redis-server and connects to it on the port you specify here. The default port is 1 less than the default production interface port, or you can pick a port that you know isn't used by anything else you run. [hit return to accept the default]

What port should the testing redis-server listen on?

As for the testing interface port, either use the default for single-user installations, or use an environment variable like ENV{vrpipe_testing_redis_port}.

When the VRPipe server runs, what should its file creation mask (umask) be?

When VRPipe is up and running, all files created will be owned by the user that starts vrpipe-server, and file permissions will be based on that user's default file creation permissions. You can set the umask that will be used by setting it here. Eg. typically 0 would mean r/w by everyone, 2 would be like 0 but not writable by other, and 7 would be like 2 but unreadable by other.

When the VRPipe server runs, what user id should it run as?

If vrpipe-server is run by root, it can switch the user that owns the daemonized server process to the user id (id -u username) you specify here. By default the user id is your current user id. If not initially run by root, this option has no effect. [hit return to accept the default]

When the VRPipe server needs to email users, what domain can be used to form valid email addresses with their usernames?

When a user creates a setup with vrpipe-setup, it is associated with their unix user name. When certain events happen to that setup (eg. it completes), VRPipe will send them an email, assuming they have an email address corresponding to the their unix user name and the domain you supply here. So enter the email domain your users can be contacted on.

When the VRPipe server encounters problems, what user should be emailed?

For certain problems that VRPipe encounters, it will email an administrative user. Enter their email username here. (If their email address was [email protected], you would have answered test.com for the previous question and just joe for this question.)

When VRPipe executes commands with exec(), what shell should be used?

Typically you should just go with the default answer here, which will usually be /bin/bash. This question is asked because the default shell on Ubuntu is dash, which does not behave the way that Perl's exec() and VRPipe expects.

When VRPipe connects to a node to run a command, what shell script can be sourced to provide all the environment variables needed for VRPipe (and any needed 3rd party software) to function?

Provide the absolute path to your shell login script, which must be readable from all nodes in the cluster. The default is probably correct.