Use Cases - VertebrateResequencing/vr-pipe GitHub Wiki
This document describes the functional requirements.
- Scope: VRPipe
- Level: user goal
- Primary Actor: Developer
- Success Guarantee: VRPipe system is usable on the compute cluster. Site-wide configuration parameters can be recalled automatically or manually when needed, and are returned as they were defined. Passwords are not visible to those without privileges.
- Developer: Wants an easy way to install the system and to test if installation was successful. Wants the system to be compatible with his hardware/software, or at least needs to be able to easily make it compatible. Wants to be able to define site-wide configuration options such as the database name and password, to save time in redefining these options (that don't change) every time a new pipeline is set up.
- System Administrator: Does not want to do a system-wide installation, or allow root access. Does not want passwords stored insecurely.
1. Developer acquires the VRPipe code.
2. Developer runs the installation script.
3. System prompts for a site-wide configuration option.
4. Developer provides the configuration option.
5. System stores the option on disc for later installation.
3-5 repeated until all options provided.
6. System runs a test suit.
7. System confirms the tests were successful.
8. System proceeds with installation.
9. System provides summary of actions and confirmation of installation success.
2a. System detects missing external dependencies:
1. System offers to install missing dependencies.
2. Developer allows this.
3. Dependencies are installed.
2a. Developer does not allow this:
1. System exits in failure.
3a. These options include:
1. What distributed resource management systems (DRM) are available?
2. Which DRM should be used by default?
3. Which queues in the DRM should be used in order of preference?
4. What are the limits (memory, time, disc space) for each queue?
5. What should be the default root directory in which STDOUT and STDERR files will be stored?
6. What is the global maximum number of jobs to be scheduled per-user across all queues?
7. Which database systems are available?
8. Which database should be used by default?
9. What credentials are needed to read and write to the database?
4a. System detects an invalid supplied option:
1. System explains what options are allowed and repeats 3. for the same option.
6a. One or more tests failed:
1. System shows which tests failed and exits in failure.
8a. Installation fails for any reason:
1. If known, system reports why installation failed, eg. no permission to write files.
2. System exits in failure.
- Database and DRM abstraction, to improve portability
- Scope: VRPipe
- Level: user goal
- Primary Actor: Researcher
- Preconditions: User is authorised to use the system.
- Success Guarantee: All input data analysed by the defined series of tools (run with their defined parameters) for this pipeline. Results integrity checked, stored and remain retrievable and verifiable. The actions that were carried out to generate the results were stored and remain retrievable. This particular pipeline won't be run again if neither input nor parameters change.
-
Researcher: needs a very easy way of carrying out an analysis on their dataset by choosing or creating a pipeline and running it. They need to be notified of non-recoverable errors and should have an option of monitoring progress, but otherwise wish to continue working on other things and not worry about the pipeline whilst it runs. They need to be notified when the pipeline has completed. They need to be able to remind themselves of what exactly the pipeline did, and what result files it generated.
-
Manager: wants to be able to define ongoing analysis projects and automate the start up and completion of their pipelines so that analyses occur without any human action at all.
-
System Administrator: wants resources of the compute cluster to be used efficiently and in the proscribed way.
-
Team Leader/Principle Investigator and External Collaborators: need to know when analyses (pipelines) have started or finished, have good estimates on completion time whilst they are still running, and be assured that once finished the results are complete and correct (their integrity has not been compromised).
-
Scientific Research Publication: may require complete details of an analysis (the specifics of each step of a pipeline), so this must be easily recallable after a pipeline has finished running.
1. Researcher accesses one of the VRPipe interfaces for initiating a pipeline.
2. System asks for a definition of the input data.
3. Researcher provides details.
4. System presents a choice of pre-defined whole pipelines and individual actions that are compatible with the type
of input or previous selection, and also options to custom define a new action, or to finish selection.
5. Researcher makes a selection.
6. System records the selection, appending it to the new pipeline being created.
4-6 repeated until Researcher chooses to finish selection.
7. System presents a summary of the newly defined pipeline.
9. System asks for a user-friendly name for the pipeline.
10. User supplies a name.
11. System finalises and enables the new pipeline, then terminates the interface for pipeline initialisation.
12. In the background, the System begins executing steps of the pipeline by submitting jobs to the DRM.
13. System emails Researcher when the pipeline completes.
* At any time, any authorised user can request to see the status of the system:
1. User accesses one of the VRPipe interfaces for monitoring pipelines.
2. System presents a summary of all running and recently completed pipelines.
* User supplies a pipeline name parameter or selects one of the listed pipelines:
1. System presents more detailed information on the state of that single pipeline, including the time it
was started, the estimated time to (or actual time of) completion, overall "health" (number of
successful commands compared to failed ones), and details of any current problems. The interface
allows the User to optionally further drill down into the finer details.
* At any time, any authorised user can request to see the set-up details of a particular pipeline:
1. User accesses one of the VRPipe interfaces for viewing pipelines.
2. System presents a summary of what actions the pipeline consists of, with an optional drill-down to see the
individual command lines that were/will be run. It also describes the input and output data (type, location,
size).
2a. If the pipeline has completed, User has the option of verifying the result files have not lost
integrity:
1. System compares the file checksums stored immediately after the result files were created to the
current file checksums.
1a. Checksum comparison fails:
1. System offers to roll-back the pipeline as little as possible in order to recreate the bad
result files.
12a. System detects a job hasn't been scheduled by the DRM for a long time:
1. System queries state of the DRM.
2. When state is OK, another attempt at scheduling (submitting) the job is made.
1a. The DRM cannot be reached for a long time:
1. System sends an email warning to affected users.
2. System keeps trying (12a:1), but will not email again for this event.
1b. DRM reports that queues are closed:
1. System silently keeps checking until queues are open.
1c. DRM reports there is a problem with the job submission:
1. System sends an error email to affected user.
2. A block is placed on the affected part of the pipeline so that further submission attempts are not
made until a user manually fixes the problem and removes the block. An interface is provided to
investigate these problems easily.
12b. System detects that a job has been in the pending state for a long time:
1. System queries state of queue and pending reason according to the DRM.
1a. Queue is full:
1. No action is taken.
1b. Resource requirements will never be met:
1. Job switched to queue offering sufficient resources
1a. No such queue exists:
1. Job cancelled and resubmitted with lower requirements
1a. Job fails a single time with the lower requirements due to running out of resources:
1,2 see (12a:1c)
1c. Resource requirements are much higher than the average resources used by previous similar jobs:
1. Job cancelled and resubmitted with lower requirements (1 s.d. above average)
1a. Job had previously had its requirements increased due to failing at lower requirements:
1. System sends a warning email to affected user (only 1 for this reason, action and pipeline)
2. The job is left to continue pending in the hope it eventually runs.
12c. System detects that a running job has been running long enough that it will soon exceed run-time limits of the
queue it is running in:
1. Job switched to queue offering greater time limit and greater or equal other resources
1a. No such queue exists:
1. No action taken.
12d. System finds a job that was killed by the DRM due to using non-time resources greater than requested or
allowed by the queue the job was submitted to:
1. System adjusts the job's required resources by increasing the requested problem resource (either
speculatively if no previous similar jobs failed for the same reason, or to the average amount needed by
other similar jobs that failed for the same reason).
2. System resubmits the job with the new requirements, in a different queue if necessary.
1-2 repeated until the job no longer fails due to running out of resources.
2a. No queue allows for the increased resources desired:
1,2 see (12a:1c)
12e. System finds a job that was killed by the DRM due to some non-resource-related reason:
1. Job is resubmitted with no alterations.
12f. System detects a job that failed whilst running, not due to being killed by the DRM:
1. System resubmits the job with no alterations.
1a. Job fails again:
1. System resubmits the job again.
1a. Job fails again, and the failing action is defined with a failure behaviour:
1. The failure behaviour is carried out. This may be to roll-back to a previous action and try
from there.
1b. Job fails again, and the failing action is normal:
1,2 see (12a:1c)
3. One option available to the user to resolve the problem is to to force the system to go back
one or more steps and try again from there, possibly with alternate parameters provided by
the user for use in this particular case only.
12g. System detects a job that was running, has not exited, but is no longer responding:
1. System queries DRM about job state.
1a. Job has been suspended by the DRM for a long time, or DRM cannot be reached for a long time:
1,2 see (12a:1a)
1b. DRM claims job is running, or user chooses to trigger this behaviour following (12g:1a):
1. System kills the process associated with the problematic step, clears the job from the DRM if
possible, then as (12f).
- Database-based storage of state: minimise file-system access; increase concurrency; survive hardware and software crashes.
- Store meta-information about and track locations of files in the database: minimise file-system access
- Must not be a single process that discovers what jobs need to be submitted next for all pipelines: minimise time that the last pipeline wastes while waiting for earlier pipelines to be processed
- There must be protection to ensure critical VRPipe processes are always running at any given moment, so that pipelines do not grind to a halt because a key process was killed.