2024‐07‐16 Meeting Notes - TheEvergreenStateCollege/bioinformatics GitHub Wiki

2024-07-16 Tuesday Meeting

Dominic and Paul

Revisit the suffix tree and suffix link compression idea from Taylor and Paul's discussion last time.

From the notes, it still looks possible.

We weren't able to to find a counterexample last time.

Another idea for suffix tree compression

  • Construct in memory as far as possible
  • Monitor free system memory
  • When it is low, begin selecting something to write into db
    • What is this something? Suffix linked subtrees?

Updating a suffix tree in DB appear expensive, which is the task as stated in #31

If we separate an "in-memory" and a "in-database" portions of the tree, how do we demarcate the two?

Updating the in-memory tree, we've already been doing.

Can we defer operations to "in-database" part, and apply them in batch if we ever need to query this subtree again in the future?

Like journaled filesystems, ext2, ext3, ext4 .

Grapevine Hypothesis

This assumes that the "Grapevine Hypothesis" is true.

If we prune a sub-vine off the parent vine, and then continue growing it, when we graft it back to the parent vine, the entire combined vine will look exactly as if the sub-vine had never been pruned. This is true if the sub-vine doesn't need any information from "higher up" the vine, its parent or siblings, to decide how to grow.

Pseudo Garbage Collection

When we choose a portion of the suffix tree to offload, we need to mark it as "no longer needing to be updated", and possible its memory as "available for re-use". This is similar to garbage collection.

What data structure representation would make it easy to "deallocate in chunks" like this?

Maybe deque's from C++ (array of arrays), to be able deallocate middle chunks, or store un-contiguous memory regions that play nicely with Rust's memory allocation.

A "sparse double array."

image This sketch shows (or tries to) this idea of being able to eliminate segments of data by adding a second layer of indexing. As nodes know there children by index. This second layer allows us to evict children (or entire data portions) but leave the slot in the first layer so they can be loaded back in at a later time.

Created GitHub Issues

We have a new GitHub Project for issue tracking.

We decomposed the following tasks from the larger project