Home - VertebrateResequencing/vr-pipe GitHub Wiki

VRPipe is a new pipeline management system, still under development, though it is being used in production. It was used to do the bulk of the data processing for the 1000 Genomes Project, and continues to be used to automate the running of software for even larger-scale sequencing projects at the Sanger Institute.

A pipeline management system lets you define a series of commands you wish to run (the 'pipeline', where each command typically corresponds to a 'step' in VRPipe's parlance). You can then put a data set through that pipeline, and the system ensures that the data is passed through to each command and that the commands run successfully in the correct order. The main benefit arises when you're dealing with many input data files on which you need to run an identical (except for file paths) set of command lines; the pipeline management system can run these independent series of commands in parallel, and potentially complete the work on 1000 input files in the time it would take to work on 1 (assuming you have a 1000+ CPU cluster).

Features

Easy to define Steps and Pipelines
Optimal memory reservation for jobs
Batching of jobs (even from different pipelines) based on their compute requirements
Automatic job retries on job failure
Quick and easy access to job errors for diagnosis
Quick and easy failed job resubmission (ie. after you've fixed the problem)
Detailed monitoring of current status (eg. how much of a pipeline has been completed so far)
Email notification on pipeline success or failure
Recorded job statistics (run time, memory usage)
Recorded history, such that given an output file VRPipe can tell you exactly how that file was made
Searchable output files by metadata
Automation for on-going projects where new input files arrive over time
Automatic handling of discovered mistakes in the input data, 'withdrawing' bad output files already created from the bad inputs and redoing any work necessary (for example, if a pipeline merged some good inputs and some bad inputs, the merge would get repeated with just the good inputs)

Guides

Future Plans

Improve web interface
Complete POD documentation, and improve/extend this wiki
Add Postgres support