Taylor Dev Diary - TheEvergreenStateCollege/bioinformatics GitHub Wiki

Taylor Dev Diary

Week 9

What I want to do: Investigate how to reduce memory in the suffix tree, specifically if we can use suffix links to remove redundant paths with reasonable compute/space complexity.

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

I spent 3 hours Tuesday and 4 hours Wednesday and an hour Thursday summarizing my finding into a document on the topic of finding redundant paths using suffix links for the implicit suffix trees. Now after rereading and considering my ideas, I'm realizing that I'd originally thought that Ukkonen's original paper suggested suffix links could be useful in search, but I'm realizing that it was an original idea I had formulated from reading the paper and not something proven. But I typed up this document here with what I found and I don't think that it's plausible to use suffix links for finding duplicate paths because there isn't any provable relationship that suffix linked nodes must have for alphabet sizes other than 2. Therefore I think it doesn't present any promise as a useful method. Spent another hour on Thursday talking about it in the group chat on Discord.

Here's the document I wrote:

Can We Use Suffix Links To Easily Find Duplicate Paths?

5/28 (3hr)

5/29 (4hr)

5/30 (2hr)

Week 8

What I want to do: Continue in helping the team understand and implement Ukkonen's Algorithm for the Suffix Tree Constructor. Continue learning Rust (It's been a struggle) and checking for bugs in the constructor. Also research more about reducing space complexity of constructor

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

Over the weekend I talked to Gavin about getting redundant memory deleted. Further considered recursive function that checks if a node has a suffix link and if it points to a node other than root than have a recursive function that deletes all paths from that node to a leave in the suffix tree. Turns out that two nodes being suffix linked doesn’t guarantee that they share the same children so I have to rework some of my ideas.

On Sunday I started looking into that, also studied some more Rust with the Rust by example but gonna write any tree compression algorithms in C++ and then translate to Rust as that’s my language of choice.

On Tuesday I started thinking that it may end being far more expensive to run tree compression than just having a bigger tree Also want to ensure I don’t reduce its information. Thinking about some ideas and sketched some of the possible techniques on my blackboard. Really was just brainstorming.

On Thursday I continued further thinking about compressing suffix trees, decided I’m going to work backwards in complexity, so first I’m looking at nodes with suffix links that have only leaves as children. I wrote a python file p_analysis_of_duplicate_walks.py under the tools section to calculate the bounds of duplicate nodes. Not sure if it’s helpful yet. Might manipulate it for different distributions of data. I think I spent way too much time on this, whoops oh well.

At the Friday meeting I helped the group cover a review of Ukkonen's Algorithm with an example of S = {abcabczcx} as the tree it constructs serves as a good counter example to the theory I had that suffix links representing a universal relationship between nodes of identical paths.

The visualizer I like: Visualization of Ukkonen's Algorithm

And the thesis I've used for review:

Implementation of Suffix Trees Using Two Linear Time Algorithms and Their Application To String Matching by Tongue Li

5/19 (4hr)

5/21 (3hr)

5/23 (6hr)

Week 7

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

Helped Gavin check that the reference C++ code that we’re working with is correct. It appears that it does not handle suffix links properly.

I wrote a test for checking if root has an alphabet sized number of children

Over the weekend talked to Gavin as the suffix_trees.rs seems to construct a proper suffix tree, but it’s running into memory issue so I'm thinking about how we can reduce memory possibly by removing redundant paths via a recursive function? We were also talking about other ideas to reduce memory or at least move it over to time length.

On Monday Worked on unit test for like 4 hours for checking if root has an alphabet number of children Spent an hour before and after just going over what the unit tests should be.

Started writing up some notes on memory, spent some time Friday working on Rust by example

5/13 (4hr)

5/14 (4hr)

5/17 (4hr)

Week 6

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

At this point the studying has paid off and my understanding for Ukkonen's is pretty solid, I spent more time doing some by hand, most of the work was towards the end of the week once the constructor was up and running. I met with Gavin on Thursday to go over the results. We noticed an off by one error that we weren't sure the solution was for, after some analysis I found at home that we needed to consider subtracting our ends for internal nodes by 1 to account for overlap. That seemed to generate the right tree for S = 'XABXAC' but we would later find out that it fail in other tests but I left that to next week. My study with Ukkonen's helped to graph them by hand and check for validity quick but this has been a useful tool for generating them quickly:

https://brenden.github.io/ukkonen-animation/

Just make sure though that you account for having no termination character as ours does.

Over the weekend I kept going through the Rust by example but admittedly it's been a challenge learning Rust, I think it's just so syntactically verbose and I'm pretty much finding what's relevant to the codebase and learning those parts as I go.

5/6 (2hr)

5/8 (2hr)

5/10 (4hr)

5/11 (3hr)

5/12 (2hr)

Week 5

What I want to do: Continue in helping the team understand and implement Ukkonen's Algorithm for the Suffix Tree Constructor.

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

I spent most of my time practicing construction suffix trees by hand with a special focus on UKkonen’s the three rules of extensions.

My main reference has been the Masters thesis I found a bit ago, “Implementation of Suffix Trees Using Two Linear Time Algorithms and Their Application to String Matching” by Tongyu Li as I’m slowly understanding more and more of Ukkonen’s original paper.

Thesis: https://shareok.org/bitstream/handle/11244/11877/Thesis-1999-L6935i.pdf?sequence=1

The best video I found was here: watch

I met up with Dominic, Gavin and Cassidy on Friday to pair program walk_down for suffix_tree.rs we’ve made a significant amount of progress.

Went through a few relevant parts of Rust by example

5/29 (3hr)

5/1 (2hr)

5/2 (2hr)

5/3 (5hr)

Week 4

What I want to do: Help the team with understanding and implementing Ukkonen's Algorithm for the Suffix Tree Constructor.

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

I spent most of my time practicing construction suffix trees by hand with a focusing on some general concepts with graph theory, and what a correct suffix tree result looks like.

Thesis: https://shareok.org/bitstream/handle/11244/11877/Thesis-1999-L6935i.pdf?sequence=1

The best video I found was here: watch

Also went through a few relevant parts of Rust by example

4/22 (4hr)

4/23 (1hr)

4/24 (3hr)

4/25 (4hr)

Week 3

What I want to do: Since we're moving to read-alignment it seems, I'll look further into long read alignment algorithms and help with figuring that out. I think the immediate next step specifically is working on getting a suffix tree constructor programmed so we can have an idea of how it'll implement with our read alignment algorithm.

My goal this week is to put 12 hours into Smarty Plants (Outside of meetings/pair programming)

4/18 (8hr)

Started working at like 9:30 and stopping as I write this at 5:30. Went to the library and saw Dominic and Gavin and we discussed the papers for Ukkonen's further.

4/17 (5hr)

Spent about 2 hours in the morning then 3 in the afternoon reading the thesis linked below. Also just working to understand both. I think I have a decent understanding of both so far so now just looking at edge case differences between Ukkonen's and McCreight's and seeing if there's any big plus or con to using either.

Implementation of Suffix Trees Using Two Linear Time Algorithms And Their Application To String Matching

4/16 (4hr)

Had a good meeting today, I think the plan is to meetup soon (perhaps before Friday?) And we can all pair program with myself focusing on explaining and helping implement the algorithm the best I can.

Got home from a long day and studied the algorithm further (Ukkonen's), I think it's slowly making more sense.

I'm mainly using this as my reference in case anyone else is interested. Ukkonen's Suffix Tree Algorithm

4/15 (4hr)

Spent a lot of time considering our options with Suffix Trees specifically. My understanding so far is that in read alignment the suffix tree will only need to be constructed for the reference genome, given its length is often very large this is where the indexing becomes an important tool. I believe this means also that it’s probably best for us to get a working “Tree constructor” up and running ASAP in Rust so we can know exactly how we’ll use it with our chosen read alignment algorithm.

The most common algorithms seem to be Ukkonen’s and McCreights. The former seems to offer more efficient construction. So, I lean towards that but it’s also substantially more complicated. Additionally I would choose to go with Ukkonen’s as there’s a far larger collection of resources for understanding it compared to McCreight’s. I can spend most of my time this week on learning/implementing this with Dom so that by Friday or in the next week we can have a working suffix tree constructor?

4/14 (1hr)

Spent about an hour today reading specifically on suffix trees and posted some questions and comments in the discord along with resources.

Week 2

What I want to do: I want to keep studying Rust as I have no background in it. I think I've become the designated maths person, so I want to spend a lot time ensuring I have a solid understanding of De Bruijn graphs to help the team.

My goal this week is to put 12 hours into Smarty Plants.

4/5 (3hr)

I spent 3 hours reading up on De Bruijn Graphs as that seems to be what the paper suggests, and sent some ideas to Cassidy and Gavin. This will take some time to implement.

4/6 (1hr)

Spent an hour reading the Smarty Plants repo and responding to an issue.

4/7 (1hr)

Spent an hour studying Rust

4/8 (4hr)

Spent four hours studying De Bruijn Graphs and Rust.

4/10 (4hr)

Spent 2 hours this morning, first looking over graph.rs and what we have so far. I also noted in the discord that we should make sure we can record adjacency matrix for digraphs properly which led to a question regarding if we need to have knowledge of previous nodes. I believe since it's a digraph we only need knowledge of the next node at least for construction. I sent a video in the discord.

I also reorganized the resources page and added headers for ease of use when navigating. I also added some resources I think the group will find helpful.

Spent 2 additional hours this afternoon continuing my studying and work.

4/11 (1hr)

Spent an hour this morning looking into what might be needed from the reads so like I/O. Realized that the FATSFA file has a read id with read={id}. Posted a message in the discord.

4/12 (2hr)

I think we had a good meeting this morning.

Also spent about two hours this afternoon reading about read alignment and suffix trees. I also posed a question to the discord regarding our read lengths and had a small conversation with Dom and Gavin about that. It seems in using read alignment that certain algorithms work better with certain read lengths but I think given our goal to build a more general program it’s perhaps best to just be indifferent towards read lengths and have for now an algorithm that’s decent for either short or long reads so I’ll look more into that this weekend and as a group as we can decide the best path forward with the algorithm.