Basic Usage - VertebrateResequencing/vr-pipe GitHub Wiki

This guide provides a beginning-to-end overview of using VRPipe.

When considering the things you need to do to use VRPipe, there are three main roles: the admin, the (step & pipeline) creator, and the end-user. One person might do all 3 roles, as in the case of a single-user environment. But if there are other people in your team who have taken up the admin and creator roles and you know you're just an end-user, you can just concern yourself with the end-user section and skip the others.

The critical steps in using VRPipe are:

  1. Have the vrpipe-server running (admin)
  2. (rarely) Create steps and pipelines for brand-new workflows (creator)
  3. Use vrpipe-setup to define what work you want done (end-user)
  4. (optional) Check current status of work with vrpipe-status (end-user)

The Admin

As the administrator of a VRPipe installation, you should have followed the other guides on this wiki to get VRPipe downloaded, configured, tested and installed correctly.

While there may be many different end-users who use your installation to work on their data, all of the result files created by VRPipe will belong to the user who runs vrpipe-server. So this user must have appropriate permissions to read everyone's input data and write to some output location that all the end-users have permission to read from. There is a vrpipe-permissions script that lets you easily change permissions on created files after the fact, so the "must be able to read input data" requirement is more important. The user must also be able to submit jobs to your job scheduler, if you're using one. All the work you do when administrating VRPipe should be done as this user, who should also have a sensible umask and default group. You must have access to and regularly check email sent to this user.

The main thing you do as the administrator is start the vrpipe-server:
vrpipe-server --farm my_farm --max_submissions 1000 start
The server runs in the background as a demonized process. You should run it on the machine that your end-users are going to use to run other vrpipe commands. It is low on CPU but may need around 1GB of memory. The --farm option takes any name you like. The only thing you need to be careful of is to always use the same name when starting the server on the same farm (aka cluster). The option exists because it is possible to have multiple vrpipe-servers running on multiple different, unconnected clusters, all using the same database. --max_submissions is optional (defaults to infinity), but can be used to limit how much work VRPipe will try to submit to your cluster at once. If you wanted VRPipe to only ever use half the resources of your 2000 core cluster, you might set this value to 1000. When the server starts it prints out a URL that can be used by end-users to monitor the status of their setups. Provide them with the URL. Assuming you don't change VRPipe's configured interface port or the machine you start the server on, the URL does not change, so they can bookmark it.

If there are any problems with the server itself, or with certain things that end-users might not be able to cope with themselves, you should get an email. Problems are also written to the log file in your configured logging directory. Note that you will have to arrange to rotate that log file yourself; VRPipe will just append to it forever (though if you're not suffering from any major issues, the file won't grow that large).

You can gracefully shut the server down if necessary with vrpipe-server stop. Sometimes processes get stuck; check ps xjf for vrpipe-server processes and kill them manually if necessary (but wait a minute for the graceful shutdown to occur first). If you're about to do an upgrade, especially one that affects the database (schema change), also either wait for any vrpipe-handler jobs in your job scheduler to complete, or kill them using your job scheduler's kill command. With no handlers and no server running, there should be nothing writing to your database (which makes it safe to upgrade, or take a mysqldump if desired).

It is your job to keep VRPipe up-to-date. This early in its development there are frequent significant improvements and critical bug-fixes that you need to keep on top of. Unless IMPORTANT_NOTES says otherwise, it is strongly recommended you upgrade to the latest versions of VRPipe (on the stable master branch) as soon as possible. See the installation guide for guidance on how to upgrade.

The Creator

VRPipe (currently) comes with some built-in steps and pipelines related to processing biological sequencing data. If you want to use VRPipe to do something else, you'll have to create your own steps and pipelines.

For example, let's imagine your team works on files of type .aaa. You have developed a command line tool 'foo' that processes .aaa files (taking a -p option), does something useful, and creates .bbb files as the useful result. Then you've developed another tool 'bar' that takes .bbb files, analyses them, and creates a .png graph. Now imagine that each member of your team gets 1000 new .aaa files each month to process.

It is possible for each user to manually run this series of commands:

  1. foo -p 19 file1.aaa > file1.bbb (repeated 999 more times for the other input files)
  2. bar file1.bbb -o file1.png (repeated 999 more times for the other input files)

However, using VRPipe in this situation gives you lots of benefits, including making sure that all those commands get run successfully, and automating the process so that an end-user only has to run vrpipe-setup once and then not worry about it.

To enable this you can create some new steps and a pipeline. A step typically corresponds to a single command command line, so you'd do:
vrpipe-create_step '$foo $foo_options $foo_input > $foo_output' (and then interactively answer some questions, giving this step the name 'foo')
vrpipe-create_step '$bar $bar_input -o $bar_png' (and then interactively answer some questions, giving this step the name 'bar')

Now VRPipe has steps for running foo and bar (it 'knows' about them), and you won't have to re-create these steps again in the future: the above is a one-time operation.

To create a pipeline:
vrpipe-create_pipeline
This will interactively ask you some questions, including what steps are in the pipeline - you'd answer 'foo' followed by 'bar'. When asked, give it a name like 'foo_and_bar'. Again, this is a one-time operation. VRPipe now knows what order to run the steps in, and how the data flows between them.

Now end-users can use vrpipe-setup which will interactively ask them what pipeline they want to run (they'd answer 'foo_and_bar'), what their input data was (user fred might answer /home/fred/aaa_files.fofn, which contains a list of his first 1000 .aaa files that need processing), and what the foo_options were (fred might say '-p 19'). And then VRPipe takes care of the rest automatically, running the necessary 2000 command lines. If Fred then later gets another 1000 .aaa files and needs to process them with -p 19, he just appends the 1000 new .aaa file paths to /home/fred/aaa_files.fofn; he doesn't have to run any vrpipe commands again. If he wants to do a new setup with -p 32, he'd go through the vrpipe-setup process again. At no point did Fred need to know how foo or bar are normally used on the command line. As the creator you just needed to have told him that the foo_and_bar pipeline exists and that he should use it, setting foo_options to -p xx as appropriate.

The above way of creating steps works for simple ideal cases like in the example, where it is assumed that the 'foo' and 'bar' executables are well-behaved in that if their output files are not correctly and fully written out to disc they will exit with a non-zero status. To have output file validation for badly-behaved executables, or to alter behavior based on file metadata, or to do map-reduce type work, you need to instead write a VRPipe::Steps:: Perl module. How to do that is beyond the scope of this guide (though it is recommended for important production pipelines, because in module form you can have a test script for it and have it under version control). Until a Step-writing guide is written, see existing Step modules for examples.

As a general point, if you do start coding with VRPipe, any Perl script you write that wants to use some VRPipe code should typically "use VRPipe;". After that most things should work without further "use" statements. To choose the deployment (which database you connect to, production or testing, where production is the default if unspecified) you can "import" the one you want. eg. to work with your testing database:
use VRPipe 'testing';
or on the command-line:
$ perl -MVRPipe=testing -e '...'

The End-User

If you find yourself in a situation where you need to run the same series of command lines on multiple files (changing only the file names in the command lines), you're in a situation where VRPipe will benefit you a lot. A 'creator' in your team should have created a VRPipe 'pipeline' that does what you need (it defines what the command lines should be, their order, and how they pass data between each other) and told you its name.

To get started you need to create a 'PipelineSetup', or 'setup' for short. A setup is basically your way of telling VRPipe that you want to pass a given set of data through a given pipeline, with a given set of options to each command ('step') in the pipeline. Once you've told VRPipe what you want, after a short delay it starts carrying out your wishes in the background, and sends you an email when it is all done.

You create a new setup by:
vrpipe-setup
This interactively asks you a series of questions that you should answer. Just be aware that any time you provide a file path, you should make it an absolute path. How to answer the questions should largely be self-explanatory, but see the vrpipe-setup questions section below for guidance.

Having run that, you can now just wait to receive an email telling you that it has either finished, or ran into problems.

Before then you can also check up on the status of your setup(s) by running vrpipe-status. That command (like all vrpipe-* commands) has a --help (or -h) option that shows help text and describes other options you may find useful. Always take the time to read about the extra options on vrpipe-* commands. The VRPipe admin should also have given you a URL that you can view in your browser to show you the status of your setups. This is actually faster than using vrpipe-status, so you may like to bookmark that URL and visit it often.

Setup Completion

You're able to use VRPipe to do things like run the same set of input files through the same pipeline but with slightly different options to one of the steps. VRPipe can run these multiple similar setups at the same time without risk of one setup overwriting and corrupting the output of another setup because the output files of each setup go to their own unique location within a special directory structure within the output_root you specified during vrpipe-setup. But this special directory structure and potentially meaningless naming of output files makes it impossible for you to just look in your output_root directory to find your output files. Instead, after your setup completes, you use vrpipe-output (it has lots of possible options: check the --help) to create a set of symlinks in a directory structure and with a naming scheme that makes more sense to you.

One of VRPipe's main features is the idea of on-going automation. Imagine you created a setup with a datasource of a fofn that listed 1000 input files, that went through a pipeline called 'foo_and_bar', with the foo_options of '-p 19'. This setup ran and then completed. Now imagine that a month later you get 1000 new input files that you also want to process with the foo_and_bar pipeline with the same '-p 19' option. Instead of creating a new setup by running vrpipe-setup again, you simply append the 1000 new input file paths to the end of your existing fofn. Without running any vrpipe commands, VRPipe will notice the change to your datasource and automatically start processing the new files. You'll receive an email when all 2000 input files have completed the pipeline.

However, if you know that one of your setups is well and truly complete - that you won't be getting any new input data for it - then you should run vrpipe-setup --setup [id] --deactivate so that VRPipe stops checking to see if your datasource was updated yet. Never --delete a setup that produced any useful output; that option is intended only for setups that were created in error.

Another useful feature to note is vrpipe-setup's --based_on option. Imagine in the previous scenario you wanted to create a setup identical the one you already made, except that you wanted to have the foo_options option set to '-p 32'. By using --based_on you can quickly duplicate the previous setup and change only what you need to change, hitting return to accept the defaults for all the other questions.

Finally, if you've been using some file that VRPipe made but have since forgotten which setup made it, or want to know how it was made, or want to search for files that have certain metadata etc., vrpipe-fileinfo with various options will be helpful.

Chaining Setups and Moving Files

Sometimes you may find that you want the output of one setup (for, say, pipeline X) to be the input of another setup (for, say, pipeline Y). You could have 1 pipeline Z that had all the steps of X and Y, but that might be less flexible if you need to do the steps of Y independently, or if Y also needed the output of yet another setup. You can effectively chain setups together by using the special 'vrpipe' datasource when defining your Y setups. Since VRPipe knows what steps in which setups created what files, it is then able to handle dependency situations automatically. If you deleted a file produced by X before Y had a chance to use it as input, VRPipe would automatically recreate the missing file by rerunning the appropriate steps of X.

For this and other reasons, VRPipe must always know about the status on disc of its input and output files. You should not use normal unix commands like 'rm' or 'mv' on VRPipe-produced files. Instead use vrpipe-rm or vrpipe-mv. When you use vrpipe-mv to move a file produced by X, Y will proceed, using the input file at its new location.

When Things Go Wrong

For normal errors caused by failing jobs in a pipeline, and for most other issues, you should receive an email that either specifies the problem, or asks you to investigate (and fix) with vrpipe-submissions. Using vrpipe-status will also point out issues in case you're not receiving emails.

In the rare event that vrpipe-status and vrpipe-submissions do not reveal the problem, you can investigate further by getting the admin to look at the logs. For problems with a particular setup, use vrpipe-logs to see what has been happening with it.

Sometimes it can happen that a step is failing because the output file of a previous step is corrupt. VRPipe only ever automatically keeps retrying the current step, so it will keep failing. vrpipe-submissions only gives you an easy way to force the reattempt of a failed submission (ie. which will be for the current step), so it isn't helpful either. In this case you should use vrpipe-elements, which provides an easy way to 'go back' and reset steps that VRPipe thought completed successfully. Just be careful not to redo any step that outputs files that might be used by other running jobs in your setup, or by other setups.

vrpipe-setup Questions

When you run vrpipe-setup to define a new setup you are asked a series of questions. While most will be self-explanatory, some need some discussion.

...coming soon...