Home - mmclellan/SuperTransript GitHub Wiki

SuperTranscript - Masters of Bioinformatics research project

The object of this research project is to construct a SuperTranscript using only a set of transcripts from a given gene with out using a reference genome.

Background

This projects goal is to produce a single transcript sequence from de novo assembled transcripts. The produced transcript will contain all the unique sequences from the assembled transcripts and will allow RNA-seq reads to be mapped to a single transcript for further study of the gene.

This project will not do any read mapping or de novo assembly. The transcripts given are assumed to be from the same gene.

Approach

A graph based approach is used for this project. Firstly given transcripts are aligned to one another using BLAT. By aligning transcripts we can see what regions of sequence are shared between transcripts or exclusive to others. With this information the tool will construct a graph using the alignment information.

Each node in the graph represents a single base pair. Each node will store its base pair, and the genomic co-ordinate that the base pair can be found in each transcript. If the base pair is not in certain transcript it will store a null value for that transcript. This approach will store each base within the transcripts only once.

Once the graph is construct the graph is simplified with nodes being condensed into sequences. The graph is then topologically sorted. The sequence is then read off the sorted nodes to give the SuperTranscript

Validation

As the project will be using de novo assembled transcripts it very difficult to validate the produced SuperTranscript. So far methods looking at validation include aligning the transcripts to the SuperTranscript to check that there is no gaps, repeats or regions of sequence in the wrong order. Another method is to reproduce the transcripts using the graph to check that each node is correctly stored and connected within the graph.