Meetings minutes - Atemia/Lab-book GitHub Wiki

Monday July 24th

Meeting of @Joseph Atemia and @Asis today; minutes:

Quotations are almost done; please make sure, we have quotations that are above the companies, where we want to order
- We'll talk to Alicia first to decide upon which routes to take
Heatmap plot:
- Increase line size of dendrograms to at least 2 pixels
- Increase size (spread) of row dendrogram, so it becomes visible even close to the leaves
- In short: Make the dendrograms bigger in comparison with the matrix itself!
- Back to the beginning:
  - Results:
    - Most of the traits (climatic conditions and phenotypic traits) have only a few protein function categories they are significantly associated with.
    - This means, that the protein function trait matrix is very sparse (most cells are zero).
    - We want to have a single plot that inform the viewer / reader about what traits are associated with what protein function categories!
    - Also we want to inform about the relatedness of traits among each other (column dendrogram) and function categories (row dendrogram). So those two are at least equally important as the matrix itself.
    - Additionally, we want to include the information about how many proteins are significantly associated with the respective trait and are also annotated with the respective function category.
    - Bare in mind, that this information is secondary! This is, because the number of proteins we find close to a SNP is highly dependent on the analysis and the distribution of annotation-frequency, i.e. the number proteins are annotated with a certain category, varies significantly with the category. Thus, the annotation frequency is not comparable between function categories and thus is not that informative.
  - Mode of visualization:
    - Use the non Z-transformed numbers to represent the sparsity!
    - Use a color-scheme that represents the zeros with white and the higher numbers with red
      - So, the sparsity will be visualized by the color.
      - You might need to draw lines around the cells to make things, i.e. row and column visible or identifiable, respectively.
      - An alternative might be to start with a light grey or light yellow and migrate to red. But I believe, white is better or more intuitive.
    - Alternative (maybe quicker) "minimal example": Do not use colors for the cells. Make the dendrograms and put numbers in the cells. Do not plot 0 (zeros), only print 1,2,.... This is, because actually, what we are interested in, is a "more of a" boolean matrix, which function categories are associated with which traits. In theory, we could well use checkmarks to fill the matrix cells, indicating e.g. annual temperature has SNPs (proteins) whose function is in the "transcription regulation".
    - All cells that have non zero values should have a light red background. Start with minimal example and add the colors in the second iteration. Put a two whole work day time limit on this issue.
Random Forest GWAS approach experiment:
- Code test driven, if possible, to ensure re-usability and correctness
- Start with one single climatic or phenotypic trait for a taxon where you have very convincing results in the standard (linear models) GWAS, ideally where several linear methods agree
- Then check what random forests says Take two to three weeks for this experiment: August 21st would be a good goal to report some results

Monday, August 14th

I'd like to share a few notes coming up during my meeting with Joseph (happening right now in real-time).
I just realized that I messed up a bit and our planned Teosinte meeting already is in a week. I 'll send out the invitation after this meeting, and hope that everyone is still able to make it. If not, we'll need to reschedule.
If the meeting goes through as planned, we will present a talk on the below structure/content:
- Brief summary of the Teosinte project and work plan
- GWAS and trait correlation analysis: Results for the paper
- Status of the sequencing: Extractions and decisions on sequencing platforms
- Discussions with the participants
- Planning: Next steps and next meeting
@Joseph Atemia, with the photos of the gels and the uploaded material in our Google Drive, please
- prepare a few slides for this section of the talk about
- What individuals did we decide on for the eight to be sequenced in depth and why (brief to remind everyone)
- What material have we extracted (RNA, DNA, what individuals and genotypes)
- Where do we have problems e.g. with purity
- How are we going to tackle them: two options
  - Purification kit with beads
  - Or ask the sequencing provider to do that
    - Please investigate how much this would cost
Please let us know, @Alicia, if you are available one day this week for say max. 20 minutes to decide upon which genome sequencing provider (including HiC) we select as the preferred option. The reason is, that we need to decide in order to get a discount in order to ensure which the other quotes are above their offer. Then I can go ahead and do the ordering and start the process for good.
Please send me a short calculation about what the genome sequencing with PacBio Revio will cost for the eight candidates all inclusive (preparation kits, and sequencing at the provider).
For finishing the heatmap analyses:
- Plot for level two also (mid priority)
- Plot for all GWAS results (taxa) (lowest priority)
- In the taxa specific plots, visually highlight those traits that are "specific for the habitat or phenotype of that taxa" (highest priority)
  - What is "specific", well where this taxa is an outlier in the PCA plot.
  - For the slides:
    - put in a little map (the one used many times) and mark where the taxa has its main habitat
    - write a few words about the specific climatic conditions of that habitat
    - write in a few words, if applicable, where the phenotype of this taxa "significantly" deviates from the rest
    - mark in color the climatic traits associated with that habitat
    - also mark in color, if applicable, those phenotypic traits where this taxa is "significantly" different than the others, e.g. has more tillers, more plant surface etc
As a result of our conversation, the following characteristic seem to apply to the three taxa that have significant and trustworthy GWAS results:
Mexicana: Higher altitudes and drier climate
Parviglumis: Lower altitudes and moderate rainfall
Chalco: Like Mexicana
Mesa central: Dry and moderately high altitude
Based on the PCA and the trait distribution plots, try to find good descriptions for the climatic growing conditions and phenotypic characteristics of the respective taxa
When you do this characterisation, try and record the trait distribution plot and mark (with an arrow or something) where each species is. (https://scidb-conabio.slack.com/archives/C0372V0NYE5/p1692030425575039) Add those plots (raincloud) to the slide with the protein functions

So, to summarize the goal of our integrated figure(s):

Show a condensed representation of the climatic and phenotypic traits (one per cluster) and how they correlate
- the latter especially to show which climatic trait influences which phenotypic one
Using the raincloud plots show the climatic conditions specific for each taxa and, if applicable, the phenotypic characteristics specific for each taxa
Finally show for those climatic and phenotypic traits (highlight the columns in the heatmap) the protein functions associated with these traits
Those traits that are highlighted, have the specific MapMan bins at hand for the meeting with Rainer