This is not a pipe - HealthHackAu2014/HealthHack2014 GitHub Wiki

"This is not a pipe": VLSCI Bioinformatics Pipeline

This is not a pipe.

Developers

Problem Owners

The Problem

Whole genome sequencing experiments involve complicated workflows broken down into a sequence of stages. The analysis at any specific stage consists of inputting data from one file and outputting it to another. That output file is then used as the input for the next stage.

When all these stages are chained together, they are referred to as a pipeline. Pipelines are often very complicated, but it is necessary to be able to change them for each new analysis.

Rubra is one pipeline management tool in use at VLSCI. Unfortunately, it is not straightforward to configure Rubra, especially for people from non-computational backgrounds. It would also be desirable to be able to control or query Rubra after it has launched. Finally, the current logging and error output of Rubra is inadequate and disorganised.

Our goal is to make some progress on improving Rubra in these areas.

The Solution

We were only able to progress part way towards delivering a final solution. We did formulate an approach for job control but were unable to complete the implementation. The approach we had in mind was to write a command shell with the ability to query and control jobs. This would be done using threading and message queues, which were not previously employed in Rubra. We have finished prototyping Rubra with threads.

Application/Relevance

The work we have completed so far could be used as seed for developing a more complete pipeline control solution. Analysing the problem with the problem owners did help them clarify what the nature of the final solution should be and where efforts should be focused from here.

Datasets

Rubra itself contains a basic example. We were also provided with a fruit fly genome in FASTA format.

Links

Tech stack

Tradeoffs/analysis

A problem we found was that none of the developers had immense strength in the areas required to address all the aspects of the problem. All the problem solvers had at least some experience working over ssh in a Unix environment, and two had at least some passing familiarity with python, but given the scope of the problem we found ourselves a bit out of our depth. We would have felt more comfortable if we had more expertise in python, and some previous experience implementing interprocess communication, threading, message queuing, and writing interpreters for domain specific languages.

Future functionality

Given the opportunity, we would enjoy being able to continue working with the problem owners to flesh the solution out more.

⚠️ **GitHub.com Fallback** ⚠️