Paul‐Pham‐Smarty‐Plants‐Dev‐Diary - TheEvergreenStateCollege/bioinformatics GitHub Wiki
Paul Pham Smarty Plants Dev Diary
2024-07-02
I wrote up PR #27 to merge Gavin's last work on read_alignment
branch into main
and was able to run and load the program to process up to about 73 million characters before
the operating system kills it.
Writing up some smarty plants notes on adding Prisma Rust client to write into Postgres with Docker.
2024-06-07
I spent about 4 hours in the past week thinking about Smarty Plants.
Took meeting notes
I talked with Dom before the demo day about presenting an overview of the team's work.
I'm reading through Pull Request 23 to understand the new find_substring
tests.
The new test is a meta-test that scans through all possible substrings for a given string and exhaustively
checks the (start, end)
indices returned by our suffix tree.
I'm interested to print out a few of this indices to make sure they are not matching trivially, but this gives me more confidence in the suffix tree.
2024-05-31
This past week I spent about 3 hours total thinking about smarty plants and looking at pull requests.
I worked with Dom for about 1 hour last Friday, making him late for OS class, on converting our edit distance functionality.
Start the edit distance with dynamic programming grid PR #15
I looked at and merged a PR #17 from Cassidy improving our .gitignore
I also spent an hour trying to understand suffix links in Ukkonnen's algorithm for suffix tree construction, and a big PR #14, which I'd like to spend time on today so we can request changes and merge it.
2024-05-24
This past week I spent about 2 hours thinking about smarty plants and looking at some pull requests:
- #12
- Worked with @poperigby
- #11 Unit test changes
- made unit tests more modular, added a suffix tree test we expected to pass for a particular string
I'm interested to learn more about the code for suffix trees added in the past week.
Reviewed brute force edit_distance
algorithm from Skiena and read some slides on how to improve with dynamic programming.
More to come soon.
2024-05-17
This past week I spent about 1 hour thinking about smarty plants and merging a pull request.
I read about a state grant from Richard and others that might continue funding for Smarty Plants past this summer, and also would like to offer it to Catherine Kehl, who is a new faculty starting next year who has bioinformatics interest and experience.
I would like to spend 4 hours this week.
2024-05-10
Last week, I spent about 1 hour of thinking on Smarty Plants, mostly about the code organization.
I'll propose the team create pull requests and protect the main
branch, so that they get at least
one code review, can ask for unit / integration tests to demonstrate what is intended correct operation
of the code, to spread understanding of the code.
I've also been wanting to use git submodules to organize the upper-division-cs monorepo.
It seems the team has been meeting on Wednesdays as well, that seems beneficial. I'll propose scheduling another space or zoom call for Wednesdays.
This week, I'd like to I'd like to work on and get feedback about
- combining edit_distance with the suffix tree or any string trees that rest of the team is working on
- write some unit tests for what approximate string matching would look like
- write some unit tests for the graph to use with edit_distance
- have individual meetings with team members to better understand how we're thinking about progress.
- using Criterion for performance graphs.
2024-05-03
Traveling today, so I won't be able to make the regular meeting.
I read through some of the code. It looks like transcriptomes encode some methods that are useful for operating on the large whole-genome dataset from NCBI for the M. pudica plant, or per species.
Tried to understand the new walk_down
method of Node
and how the CLI tool works.
Fixed some syntax errors in PR #7
2024-04-19
I spent about one hour this week on mostly administrative tasks and some reading through the Discord chat and repo.
Team members have been added to a Science Safety canvas course. There is an online Canvas quiz part, and an in-person class given by Jenna Nelson that I'm proposing we schedule for a Tuesday morning or Friday morning in Week 9 or 10.
CAL West to give the group an space option for Tuesday morning meetings.
I'd like to spend 4 hours this week, between the morning of Friday the 19th and the morning of Friday the 26th doing:
- reading through the group notes and dev diaries
- having short meetings with team members
- drawing an architecture diagram showing how the different parts of the project work together, and who is working on what
- writing some unit tests similar to the
graph-demo
project in the repo to better understand the abstractions given
Watched this video on suffix trees posted in Cassidy's Dev Diary. My leading question going in is, why not prefix trees? ("tries") It seems like any approach finding a longest common substring (which could be used for either de novo assemby or read alignment) could be done with either suffix or prefix, but perhaps suffix is faster.
Also, can it handle errors? That is:
abcababac
cbabacaca
If we require perfect LCS match with 0 errors, the longest LCS is babac
with length 5.
If we allow an LCS with at most 1 error, the longest LCS is ?babac
with length 6, where ?
is a wildcard
character. In the first string it is ababac
and in the second string it is cbabac
This could be a transcription or other error introduced to our reads.
2024-04-08
Work on getting all team members added back to Science Safety canvas course and keys issued for the Nancy Murray lab.
2024-04-06
Uploaded Kelsea Jewell's short talk on sequencing software, and tasks within the genome sequencing pipeline.
2023-11-03 mRNA Sequencing with Kelsea Jewell https://youtu.be/xTQK6NP1ZZE
2024-04-05
My goal is to spend four hours on Smarty Plants this week. My goal is to work on Rust interfaces between the submodules that the team determines and to help them start writing code.
Today, we had a Smarty Plants meeting in CAL with following members in attendance:
- Dee Dee
- Dominic
- Rain
- Taylor (on Discord)
- Ellie
- Cassidy
We talked about creating issues to keep track of questions and concerns about the project. My first question, which is not necessary to answer until later, is
https://github.com/TheEvergreenStateCollege/smarty-plants/issues/1