2024‐04‐12 Meeting Notes - TheEvergreenStateCollege/bioinformatics GitHub Wiki

2024-04-12 | 10:00 AM - 12:00 PM

ATTENDING

  • Dee Dee
  • Dom
  • Cassidy
  • Ellie
  • Gavin (absent)
  • Rain
  • Taylor
  • Paul

AGENDA

  1. Overview of dev Diarys
    • formed questions below...
  2. trying to answer those questions?

Last Meeting Follow-Up

New Business

NOTES

Questions formed:

- indent
  • General management
    • What are our goals?
      • what, EXACTLY are we doing?...
    • How does our graph work thus far?
  • Rust
    • Rust: how to write it?
    • File Hierarchy for rust. What does cargo want? (also an organization question)
    • unit tests, how does rust do that?
  • Graph
    • What does our graph need to become (de bruijn graph)
    • What is a de bruijn graph?
      • is it weighted?
      • What is a k-mer?
  • Bio
    • What even is read alignment?
    • We care about read ids, right? (probably, for error correction)
      • should check if they are unique between files.
    • DNA RNA, what is it good for. (absolutely something)
  • String Operations
    • Why suffix trees? (because fast)
    • split strings but rust?
  • Misc
    • genomic glossory? (what even is that)

ACTION ITEMS

  1. Paul

    • present our understanding of read alignment versus sequencing to Nancy Murray for feedback
    • share biologist feedback for our previous questions
    • invite to a future meeting
  2. numbered

    • indent
  3. Read Alignment

  • Rain, Taylor, Ellie, Dee Dee
  • explore different approaches
  1. I/O
  • Gavin, Cassidy
  • how are we reading them, what detail are we leaving in and out?
  1. Suffix Trees
  • Dom

NEXT MEETING’S AGENDA

  • Give a short presentation on the Topics

Questions Collected Today

Cultural

2. What are our community goals, values, and expectations?

Is our community us as developers, biology research, students in DSA next year?

As developers (not in order of importance):

  • Have fun
  • Make it work
  • Get a working code prototype.
  • Foster a healthy learning environment together, where people are not shamed or embarrassed for not knowing something
    • where people feel okay that they haven't learned something yet
  • Engage more deeply in bioinformatics because it is interesting scientifically.
  • Mutually uplifting environment where we are all learning and teaching each other, respectfully.
  • Able to hear each other's voices on a level playing field
  • Learn and practice Rust
  • Architect large, complex, interesting systems
  • Wanted to motivate and create some compelling DSA assignments for future students
  • Serve cutting-edge biology research without being a biologist
  • More collaboration between CS and other sciences at Evergreen for solidarity

Rust

++1) What are suffix trees, and how can we use them?

They are good for finding patterns, especially in DNA. How are we doing this for DNA processing? (i.e. How do they interact with the graphs that we are considering / building?)

++3) In graph.rs , where do we want to go with read alignment?

Currently it increases by 1000. Can / should we make it dynamic?

++11) What should our file hierarchy look like? What is the hierarchy that

Cargo wants (for modules?) How to make use of graph.rs ? (By writing tests, to start with)

++7) I am still learning Rust. How do I get good at Rust?

Genomics

+4) What is read alignment? How is it different than sequencing?

Can we do read alignment inside of sequencing, and what does that mean?

Gavin: Read alignment is a special case of sequencing, which is a more general category of tasks. (From scratch producing a consistent genome is called de novo assembly)

Read alignment is good for RNA, because they may not be contiguous. E.g. if we try to create a consistent sequence, it may be impossible if RNA comes from non-overlapping part of DNA.

Cassidy: If they don't overlap, isn't de novo assembly impossible?

Taylor: Do we already need a complete genome (DNA) to do read alignment?

Gavin: Some sections of DNA have introns, sections that get skipped over. (Other things I missed). We have to take into account these things for DNA sequencing, that we don't for RNA read alignment (?)

There are many methods for read alignment.

We could download other read alignment algorithms to check our algorithm against.

In practice, approximate matching is used.

Proposal: We focus on read alignment.

De novo assembly from our RNA fragments, which don't result in a single consistent sequence, is same as read alignment, where the resulting sequences have "gaps" that correspond to where the DNA is not covered.

Concern: error correction and de Brujin graphs are more complex.

+6) Originally when we read data in, one thought is to make it one big string with no denotation. FASTA files have many fragments, each with its own unique read ID's. Is this useful to keep track of, for error-checking at the end, after file I/O?

+8) Is the graph weighted?

+9) Can we do a walk-through of the Pevzner paper (with Paul and Taylor et al) What are the expectations of this paper?

Should we be implementing it exactly, or to what extent? To the extent that it's useful for end-to-end test, split up 200 base-pair read, reconstruct it.

+13) What are de Brujin graphs? How are they useful for our work? Is each node supposed to have a specific number of edges, and if so, how do we guarantee that, or count on it?

  1. What exactly to do?

  2. We're pretty sure we have RNA data. We think it's a single strand, instead of double, like DNA. Can we get clean data from somewhere?

Why does it matter whether we have DNA or RNA?

  1. (unit) Testing. How should we add unit tests?

  2. Can we have a genomics glossary? What are some unfamiliar terms that would help us to define (together)?

  3. What is a k-mer ?

  4. How do we split strings in Rust?

  5. How can the serde crate be useful to use?

  6. Are we using adjacency (linked) lists, or matrices in graph.rs ? Can we use linked lists of indices, rather than direct Rust references, to get around borrowing / ownership difficulties?

Table for Long-Term Thinking

  1. Can we test our algorithm against something well-known and studied, e.g. E. coli DNA

An End-to-End Test we can already do.

One fragment from FASTA , F 200 base pairs

split it into 10-mers k-mers, 190 (191?) k-mers , and see whether our algorithm gives us back the original fragment