Supplementary specification - VertebrateResequencing/vr-pipe GitHub Wiki

This document describes requirements not captured in the use cases.

Functionality

(common across many use cases)

Logging and error handling. It should be trivial to create a new pipeline, which means it must be trivial to trace what happens when a pipeline runs, and to investigate errors.
Security. Only authorised users should be able to interact with the System, and no users should be able to see passwords.

Usability

There must be a command-line interface for use by developers and automated systems; cron-job tasks must be possible.
There should be a web-based interface for at least common tasks, such as viewing current state and investigating errors.
Other interfaces should be easy to add on.
It should not be possible to accidentally stop pipelines or delete their results using the interfaces.

Reliability

Temporary problems with file systems, the database, or the DRM should not break the System. They should also not result in the System silently stopping. Temporary or permanent failures should result in a notification to users (once only per event). Temporary failures should result in automatic retries until success is achieved, without user intervention.

It should not be possible for the System to not be running and successfully processing Pipelines without Users being aware of that fact.

Performance

It should not take longer for the System to discover that a new Pipeline Step should be run and submit the corresponding Job to the DRM, than it takes the DRM to finish executing the Job. Pipeline processing overhead should not be the bottleneck during analysis.

Supportability

The System should be well documented and well structured such that external developers unfamiliar with the System can quickly extend it to meet their needs. (Best-practice Object-Oriented design should be used throughout)

The System must allow Pipelines to be highly configurable, and support changing configurations. It should be possible to run the same pipeline multiple times with slightly different parameters, without having to worry about overwriting previous results, as part of scientific research.

Implementation Constraints

Leadership suggests that VRPipe be implemented using Perl Moose, taking into consideration the local expertise and general familiarity within the Bioinformatics community with Perl.

The System must work at the Sanger, which means required support for MySQL database and LSF at a minimum.

Free Open Source Components

In general, it makes sense to maximise the use of free Perl modules from CPAN on this project to minimise development time and increase quality (due to those modules being well tested externally).

VRPipe itself is a publicly available Open Source project from its outset.