Algorithm - mmclellan/SuperTransript GitHub Wiki

Algorithm

As this project consists of multiple steps, this section will outline the approach to each section.

BLAT output parsing

BLAT outputs a .psl file with each line describing the alignment of two transcripts. BLAT returns "blocks", regions that align that BLAT believe to be exons. Important information from the .psl file include the the size of the aligned section; the size of the transcripts; the number of blocks that align and the starting point of the blocks in each transcript.

Example of BLAT .psl file

Using the information on the blocks given by BLAT the sequences for each block is then found from the original .fasta file that contained the transcript sequences.

Graph Construction

As each block sequence is produced it is passed to the function that builds the graph. For graph construction the library iGraph is used. iGraph has functions for creating graph objects. As the graph object is quite slow to iterate over and search, a hash table is also created to store which vertices have already been added to the graph.

Firstly an empty graph is constructed as well as an empty hash table. The iGraph object is able to store attributes for each node. When the graph is constructed we set global attributes so that each node created is given the same set of attributes. Each attribute works like a hash table, it stores a "key value" which has to be unique and then can store an object (a string, integer, list). For the graph designate an attribute to be "Base" which will store the base pair, and then an attribute for each transcript that will store the genomic co-ordinate that base is found in that transcript. Each vertex will also have a unique index for the graph.

Vertex attribute example with three transcripts. This base represents a Cytosine that is found in each transcript.

Vertex 0 - "Base": C, "ENST00000338799.5":152, "ENST00000456483.2":123, "ENST00000206249.3":132

As the BLAT output is parsed the graph structure is extended. When the algorithm reaches a base that is not in the graph it is added to the graph with the genomic co-ordinates for the transcripts it is found in. If the base is present in the graph (previously added from another transcript) the node attributes are updated, adding the new transcript and co-ordinate to the node.

Once the BLAT output is fully parsed the graph will contain every unique base only once as a single node in the graph, with attributes showing the transcripts that base is in and the co-ordinate for that transcript.

The graph at this point contains no edges. Edges in the graph represent sequential bases within transcripts. So if two nodes have a directed edge between them, they are present side by side in a transcript. With this information we join nodes that have transcript co-ordinates that are different by only 1.

For example if Node:145 is present in transcript: 100 with coordinate 1 and Node: 364 is present in transcript 100 with coordinate 2, an edge from Node:145 to Node:364 will connect the two. If Node:145 is also present in Transcript 3 with coordinate 1 and Node:10 is present in Transcript 3 with coordinate 2, then an edge is also added between these two nodes. Forks in the graph are a result of the alternative splicing of the gene.

Traversing the graph structure to find the nodes that are adjacent in transcripts is computationally inefficient requires searching the graph multiple times. To avoid searching the graph multiple times, the algorithm uses a set of hash tables. Each transcript is assigned a hash table with the key value being the transcripts ID and storing an empty list the size of the length of the transcript. The index in each list is representative of the genomic co-ordinate for that transcript. Each index will store the node ID that the base at that coordinate can be found.

Example: The sequence for transcript EXMPL000001 is - GCTAGTGGTC The hash table for adding the edges looks like - "Transcript- EXMPL000001 : {1,2,3,4,5,56,57,58,59, 10}" This shows that the first base "G" is stored in Node 1 and the second base "C" is stored in Node 2.

To populate these hash tables the algorithm visits each node only once and updates the hash tables using the information stored at each node. Once the hash tables are fully populated the algorithm iterates over each table adding a directed edge from one Node ID to the next in the table. By using only the Node IDs to add edges is a much faster method for adding nodes to the graph.