Answers - GeertsManon/EEG_Metagenomics GitHub Wiki

02 Installing Anvi'o for Metagenomics Analysis

Question 1: Conda version: 25.11.1, but this can vary depending on when you installed Miniconda.

03 Explore input data

Question 1: 60.11x

Question 2: All bases (100%)

Question 3: 8,516 contigs

Question 4: 29,138,064 bp

04 Pre‐binning

Question 1: Gene tRNA-synt_1c is most commonly found in archaea and appears 21 times. Genes PGK, Ribosomal_L9_c, and Ribosomal_S6 are most common in bacteria, each appearing 14 times.

Question 2: Gene secY is present 11 times in bacteria, 11 times in archaea, and not at all in protists.

Question 3: It is estimated that this sample contains 12 bacterial genomes and no archaeal or protist genomes.

Question 4: None, trick question πŸ€“

- Number of sequences in the contigs DB .............: 8,516                                                                                                                                                                                                             
- Number of contigs to be conisdered (after -M) .....: 8,516

Question 5: Pseudomonadota appears seven times.

Question 6: Sediminibacterium with a Ribosomal S6 coverage of 94.43

Question 7: Yes, two (Sediminibacterium and Methylomonas koyamae_C)

05 Binning

Question 1: Yes, for example, Sediminibacterium has a Ribosomal S6 gene coverage of 94.42x, as shown in taxonomy_summary.txt. In the interactive interface, the coverage pattern of each contig in this particular bin (black) is around 100x. Additionally, the taxonomy is similar.

Question 2: Some bins were not identified to the species level. Instead, only two received a genus-level identification.

Question 3: We are comparing identified SCG against a reference database. Our taxonomic predictions thus depend on the quality and completeness of that database. Additionally, we are analyzing a relatively unexplored cave system in terms of microbial community, so it's possible we have discovered some novel lineages!

Question 4: This can vary from person to person, so it's not rocket science. But here's an example:

Bin Name Taxonomy (full lineage) Identified to Completeness (%) Contamination (%) Genome Size (bp) Average coverage of Ribosomal S6 gene in cave A
Bin_1 Bacteria; Pseudomonadota; Gammaproteobacteria; Methylococcales; Methylomonadaceae; Methyloglobulus; sp016874115 species level 98.6 0.0 3,030,873 62.76x
Bin_2 Bacteria; Bacteroidota; Bacteroidia; Chitinophagales; Chitinophagaceae; Sediminibacterium genus level 94.4 0.0 2,060,207 94.43x
Bin_3 Bacteria; Pseudomonadota; Gammaproteobacteria; Burkholderiales; Burkholderiaceae_B; CADEEN01; CADEEN01 sp022841885 species level 94.4 2.8 2,504,219 9.55x
Bin_4 Bacteria; Bacteroidota; Bacteroidia; AKYH767; 2-12-FULL-35-15; 2-12-FULL-35-15; 2-12-FULL-35-15 sp005772895 species level 91.5 4.2 3,313,797 5.36x
Bin_5 Bacteria; Bacteroidota; Bacteroidia; NS11-12g; UBA955; UBA955 genus level 98.6 0.0 2,284,883 8.96x

Assignment

Question 1 (1p)

Grading:

  • 0.2p per high-quality bin correctly identified (completeness > 90%, contamination < 5%) β†’ max 1p
  • βˆ’0.2p if part of the bin tab is missing (either the phylogram or the bin info panel)

Model answer:

Question 2 (1p)

Grading:

  • Full marks for correct data retrieval based on the bins identified in Q1, regardless of whether those bins are high-quality or not
  • 0.2p if a column is not filled in or filled out incorrectly

Model answer:

Bin Name Taxonomy (full lineage) Identified to Completeness (%) Contamination (%) Genome Size (bp) Average coverage of Ribosomal S6 gene in cave A Average coverage of Ribosomal S6 gene in cave B Average coverage of Ribosomal S6 gene in cave C
Bin_1 Bacteria; Pseudomonadota; Gammaproteobacteria; Methylococcales; Methylomonadaceae; Methyloglobulus; sp016874115 species level 98.6 0.0 3,030,873 62.76x 0x 0x
Bin_2 Bacteria; Bacteroidota; Bacteroidia; Chitinophagales; Chitinophagaceae; Sediminibacterium genus level 94.4 0.0 2,060,207 94.43x 278.89x 228.17x
Bin_3 Bacteria; Pseudomonadota; Gammaproteobacteria; Burkholderiales; Burkholderiaceae_B; CADEEN01; CADEEN01 sp022841885 species level 94.4 2.8 2,504,219 9.55x 7.60x 0x
Bin_4 Bacteria; Bacteroidota; Bacteroidia; AKYH767; 2-12-FULL-35-15; 2-12-FULL-35-15; 2-12-FULL-35-15 sp005772895 species level 93.0 4.2 3,187,169 5.36x 0x 2.18x
Bin_5 Bacteria; Bacteroidota; Bacteroidia; NS11-12g; UBA955; UBA955 genus level 98.6 0.0 2,284,883 8.96x 134.31x 13.44x

Question 3 (1p)

Grading: Three subquestions, 0.33p each:

  • Which organisms are found in all three caves?
  • Are there organisms missing from specific locations?
  • Are there organisms exclusive to certain caves?

Model answer: Not all five organisms are present in all three caves. Bin_2 (Sediminibacterium) and Bin_5 (UBA955) are found in all three caves (coverage > 0 in A, B, and C) (0.33p). Bin_3 (CADEEN01) is absent from cave C, and Bin_4 (2-12-FULL-35-15) is absent from cave B (0.33p). Bin_1 (Methyloglobulus) is exclusive to cave A, with zero coverage in both B and C (0.33p).

Question 4 (1p):

Grading: Two subquestions, 0.5p each:

  • Which cave has the highest diversity? (0.5p)
  • Which cave has the lowest diversity? (0.5p)

Note: Diversity = richness (number of families) + evenness (how evenly distributed abundances are). You are not expected to make this distinction explicitly, but it is the complete answer.

Partial credit:

  • Stating cave A has the highest number of families β†’ 0.5p (awarded even without mention of evenness)
  • βˆ’0.25p if you state cave B has more families than cave C (the sixth family in cave C is nearly invisible in the barplot, so I will not retrieve the full -0.5p)
  • +0.25p bonus if the student mentions evenness in addition to richness

Model answer: Cave A exhibits the highest richness with nine distinct families detected in the Ribosomal S6 family barplot. Caves B and C each show six families. Cave A and cave C are both dominated by two highly abundant families, while cave B shows a more even distribution of abundances across its six families. Based on family richness alone, the ranking is: cave A (highest) > caves B and C (equal, six families each). When evenness is also considered, the full diversity ranking becomes: cave A (highest richness) > cave B (lower richness than A, but higher evenness than C) > cave C (lowest, due to both low evenness and equal richness to B).

Question 5

Grading:

  • Any plausible flow diagram proposed (even if incorrect) β†’ 0.25p
  • Correct reasoning but incorrect flow direction (e.g., "cave A has the highest number of taxa, therefore water exits there and gathers organisms from B and C") β†’ 0.75p
  • Correct flow (B β†’ A and C β†’ A) with correct interpretation β†’ 1p

Model answer: I hypothesize that cave A is nearer to the cave exit, with water from caves B and C flowing as separate upstream sources that converge at cave A. This is supported by both the high-quality genome data and the Ribosomal S6 data (family-level): all taxa observed in caves B and C are also present in cave A, whereas cave A additionally hosts organisms found in neither B nor C. Crucially, the taxa unique to cave B and the taxa unique to cave C do not overlap with each other. Cave A therefore holds the highest diversity, representing the combined and non-overlapping diversity of caves B and C, indicating it acts as a confluence point where the two separate water sources meet. The answer is thus: B β†’ A and C β†’ A.

Additional information:

Family Cave A coverage Cave B coverage Cave C coverage
2-12-FULL-35-15 5.36 0 2.19
Burkholderiaceae_B (CADEEN01) 9.55 7.60 0
Chitinophagaceae (Sediminibacterium) 94.43 278.89 228.17
CSP1-5 2.61 0 0
Methylomonadaceae (Methyloglobulus) 62.76 0 0
Methylomonadaceae (Methylovulum) 7.11 0 0.70
Methylomonadaceae (Methylomonas koyamae_C) 6.08 0 0
Methylomonadaceae (Methylomonas sp009925045) 7.52 0 0
Methylophilaceae (Methylotenera) 4.59 108.63 0
Rhodobacteraceae (Tabrizicola_A) 6.02 227.37 319.28
UA16 (CAMCTV01) 4.40 268.11 25.77
UBA955 8.96 134.31 13.44
Flavobacteriaceae (Flavobacterium terrigena) 0 0 0

The opposite flow (A β†’ B and A β†’ C) does not match the data. If the river were to exit at caves B and C, its diversity would have to split as the water diverges β€” with some taxa selectively flowing into B and others into C. However, there is no biological mechanism that would cause such partitioning: all three sites share the same environment (flowing water with comparable conditions), meaning all organisms are equally likely to be carried in any direction by the current. This argument would not hold if sediment were sampled instead, as sediment communities are more spatially constrained and locally shaped, but for water-borne microorganisms in a connected river system, selective partitioning at a divergence point is biologically implausible. What we actually observe β€” that B and C each carry a different, non-overlapping subset of cave A's diversity β€” is precisely what you would expect if B and C are independent entry points whose communities accumulate at the exit point in cave A.

Any other chain flow such as A β†’ B β†’ C is also incorrect. As shown on the map provided in the assignment (blue lines indicate streams), caves B and C are not directly connected. Any chain topology would require a direct hydrological link between B and C, which the map does not show.