Gene Machine - HealthHackAu2013/wiki GitHub Wiki

Team Members

Andy Kitchen (Developer, Silverpond, @auastro)
Jono Chang (Developer, Silverpond, @jonochang)
Toby Sargeant (Data scientist, Walter and Eliza Hall Institute)
Sally Hunter (Biologist, Peter MacCallum Cancer Centre)
Maria Doyle (Bioinformatician, Peter MacCallum Cancer Centre)
Les Kitchen (Developer, Computing and Information Systems, University of Melbourne)

The Problem

A lack of tools that allow researchers to easily and effectively assess their experimental data.

Targeted genetic screening enables the detection of changes (mutations) in selected regions from an individual’s DNA sample. New and rapidly evolving technologies are revolutionising this area enabling massive scaling up of genetic testing. New tests are being released for medical research on a regular basis and the speed with which the field moves often makes it difficult to know how robust a new test is. Samples may also perform differently on these tests, some performing worse than others either across the entire test or just for certain genomic regions. Knowing what data can be trusted and what is bad or unreliable is of critical importance. We need a tool to help medical researchers quickly and easily assess the performance of both tests and samples.

The Solution

After considering various alternatives, we settled on a web-based data-visualization tool. The experimental data is fetched in JSON format from the server, displayed in SVG, and manipulated interactively via the d3 framework. Python flask is used for the server-side data-wrangling into JSON.

In its current form, the visualization tool displays for each gene, a scatterplot of reads for each sample. This view can be scrolled, zoomed, scaled and sorted. Individual samples can be picked out and highlighted. Users can readily see which probes or samples are behaving badly.

MVP Criteria:

Visualisation/Effective representation of the data
- Meaningful information can be derived
- Aesthetically pleasing
Interactivity
- Ability to zoom and scroll
- Ability to sort
- Filter on specified criteria
- Ability to select individual samples

Application/Relevance

The new generation of DNA sequencing technologies are transforming medical research and has the potential to produce a paradigm shift in healthcare. Translation of current and newly emerging medical knowledge into reliable predictive and diagnostic clinical tests could shift society to a more preventative model of health management thereby improving the efficacy of spending of health funds. For genetic testing this translational capacity is largely determined at the basic medical research level in the discovery of disease causing genes and the development of genetic tests, both of which are heavily dependent on the quality of the data. The development of a quality assessment tool that can be applied by medical researchers to a range of data types to rapidly identify anomalies and problems, will greatly expedite the refinement of clinical tests. Effective tools of this nature will allow researchers to focus more of their time on the important biological questions, thereby increasing the cost-effectiveness of our research spending and ensure a higher quality of scientific findings being released into the public domain.

Datasets

An in-house targeted sequencing dataset was used. Since this data is sensitive, dummy data is used for the publicly deployed sample webpage.

Tech stack

d3 - http://d3js.org/, Javascript visualization framework

flask - http://flask.pocoo.org/ Python microframework

Tradeoffs/analysis

We faced two main tradeoffs: one on the visualization side, the other on the processing side. On the visualization side, we want to provide the user with a global view of the data, while allowing focussing on details. On the processing side, there's the tradeoff of how much to process client-side in the browser and how much server-side.

Future functionality

Additional, more detailed views for individual genes, with:
- Individual sample, multi-sample and aggregate view capabilities
- Genome location
- Dynamic summary stats e.g. quartiles
Simultaneously view >=2 subgroups e.g. cases vs controls
Ability to switch between views/data levels
Links out to existing tools – e.g. IGV to visualise raw data
Export data summaries
Export graphics (jpg, png, pdf etc.)
Save analyses as session/project
Expand to other data types