This is not a pipe - HealthHackAu2014/HealthHack2014 GitHub Wiki
- Mike Walsh
- Ross Anderson (@evadoross, Linkedin)
- Timothy Rice (@0x7472)
Whole genome sequencing experiments involve complicated workflows broken down into a sequence of stages. The analysis at any specific stage consists of inputting data from one file and outputting it to another. That output file is then used as the input for the next stage.
When all these stages are chained together, they are referred to as a pipeline. Pipelines are often very complicated, but it is necessary to be able to change them for each new analysis.
Rubra is one pipeline management tool in use at VLSCI. Unfortunately, it is not straightforward to configure Rubra, especially for people from non-computational backgrounds. It would also be desirable to be able to control or query Rubra after it has launched. Finally, the current logging and error output of Rubra is inadequate and disorganised.
Our goal is to make some progress on improving Rubra in these areas.
We were only able to progress part way towards delivering a final solution. We did formulate an approach for job control but were unable to complete the implementation. The approach we had in mind was to write a command shell with the ability to query and control jobs. This would be done using threading and message queues, which were not previously employed in Rubra. We have finished prototyping Rubra with threads.
The work we have completed so far could be used as seed for developing a more complete pipeline control solution. Analysing the problem with the problem owners did help them clarify what the nature of the final solution should be and where efforts should be focused from here.
Rubra itself contains a basic example. We were also provided with a fruit fly genome in FASTA format.
- Link to github repository for Rubra.
A problem we found was that none of the developers had immense strength in the areas required to address all the aspects of the problem. All the problem solvers had at least some experience working over ssh in a Unix environment, and two had at least some passing familiarity with python, but given the scope of the problem we found ourselves a bit out of our depth. We would have felt more comfortable if we had more expertise in python, and some previous experience implementing interprocess communication, threading, message queuing, and writing interpreters for domain specific languages.
Given the opportunity, we would enjoy being able to continue working with the problem owners to flesh the solution out more.