Answers - GeertsManon/EEG_Metagenomics GitHub Wiki
02 Installing Anvi'o for Metagenomics Analysis
Question 1: Conda version: 25.11.1, but this can vary depending on when you installed Miniconda.
03 Explore input data
Question 1: 60.11x
Question 2: All bases (100%)
Question 3: 8,516 contigs
Question 4: 29,138,064 bp
04 Preβbinning
Question 1: Gene tRNA-synt_1c is most commonly found in archaea and appears 21 times. Genes PGK, Ribosomal_L9_c, and Ribosomal_S6 are most common in bacteria, each appearing 14 times.
Question 2: Gene secY is present 11 times in bacteria, 11 times in archaea, and not at all in protists.
Question 3: It is estimated that this sample contains 12 bacterial genomes and no archaeal or protist genomes.
Question 4: None, trick question π€
- Number of sequences in the contigs DB .............: 8,516
- Number of contigs to be conisdered (after -M) .....: 8,516
Question 5: Pseudomonadota appears seven times.
Question 6: Sediminibacterium with a Ribosomal S6 coverage of 94.43
Question 7: Yes, two (Sediminibacterium and Methylomonas koyamae_C)
05 Binning
Question 1: Yes, for example, Sediminibacterium has a Ribosomal S6 gene coverage of 94.42x, as shown in taxonomy_summary.txt. In the interactive interface, the coverage pattern of each contig in this particular bin (black) is around 100x. Additionally, the taxonomy is similar.
Question 2: Some bins were not identified to the species level. Instead, only two received a genus-level identification.
Question 3: We are comparing identified SCG against a reference database. Our taxonomic predictions thus depend on the quality and completeness of that database. Additionally, we are analyzing a relatively unexplored cave system in terms of microbial community, so it's possible we have discovered some novel lineages!
Question 4: This can vary from person to person, so it's not rocket science. But here's an example:
| Bin Name | Taxonomy (full lineage) | Identified to | Completeness (%) | Contamination (%) | Genome Size (bp) | Average coverage of Ribosomal S6 gene in cave A |
|---|---|---|---|---|---|---|
| Bin_1 | Bacteria; Pseudomonadota; Gammaproteobacteria; Methylococcales; Methylomonadaceae; Methyloglobulus; sp016874115 | species level | 98.6 | 0.0 | 3,030,873 | 62.76x |
| Bin_2 | Bacteria; Bacteroidota; Bacteroidia; Chitinophagales; Chitinophagaceae; Sediminibacterium | genus level | 94.4 | 0.0 | 2,060,207 | 94.43x |
| Bin_3 | Bacteria; Pseudomonadota; Gammaproteobacteria; Burkholderiales; Burkholderiaceae_B; CADEEN01; CADEEN01 sp022841885 | species level | 94.4 | 2.8 | 2,504,219 | 9.55x |
| Bin_4 | Bacteria; Bacteroidota; Bacteroidia; AKYH767; 2-12-FULL-35-15; 2-12-FULL-35-15; 2-12-FULL-35-15 sp005772895 | species level | 91.5 | 4.2 | 3,313,797 | 5.36x |
| Bin_5 | Bacteria; Bacteroidota; Bacteroidia; NS11-12g; UBA955; UBA955 | genus level | 98.6 | 0.0 | 2,284,883 | 8.96x |
Assignment
Question 1 (1p)
Grading:
- 0.2p per high-quality bin correctly identified (completeness > 90%, contamination < 5%) β max 1p
- β0.2p if part of the bin tab is missing (either the phylogram or the bin info panel)
Model answer:
Question 2 (1p)
Grading:
- Full marks for correct data retrieval based on the bins identified in Q1, regardless of whether those bins are high-quality or not
- 0.2p if a column is not filled in or filled out incorrectly
Model answer:
| Bin Name | Taxonomy (full lineage) | Identified to | Completeness (%) | Contamination (%) | Genome Size (bp) | Average coverage of Ribosomal S6 gene in cave A | Average coverage of Ribosomal S6 gene in cave B | Average coverage of Ribosomal S6 gene in cave C |
|---|---|---|---|---|---|---|---|---|
| Bin_1 | Bacteria; Pseudomonadota; Gammaproteobacteria; Methylococcales; Methylomonadaceae; Methyloglobulus; sp016874115 | species level | 98.6 | 0.0 | 3,030,873 | 62.76x | 0x | 0x |
| Bin_2 | Bacteria; Bacteroidota; Bacteroidia; Chitinophagales; Chitinophagaceae; Sediminibacterium | genus level | 94.4 | 0.0 | 2,060,207 | 94.43x | 278.89x | 228.17x |
| Bin_3 | Bacteria; Pseudomonadota; Gammaproteobacteria; Burkholderiales; Burkholderiaceae_B; CADEEN01; CADEEN01 sp022841885 | species level | 94.4 | 2.8 | 2,504,219 | 9.55x | 7.60x | 0x |
| Bin_4 | Bacteria; Bacteroidota; Bacteroidia; AKYH767; 2-12-FULL-35-15; 2-12-FULL-35-15; 2-12-FULL-35-15 sp005772895 | species level | 93.0 | 4.2 | 3,187,169 | 5.36x | 0x | 2.18x |
| Bin_5 | Bacteria; Bacteroidota; Bacteroidia; NS11-12g; UBA955; UBA955 | genus level | 98.6 | 0.0 | 2,284,883 | 8.96x | 134.31x | 13.44x |
Question 3 (1p)
Grading: Three subquestions, 0.33p each:
- Which organisms are found in all three caves?
- Are there organisms missing from specific locations?
- Are there organisms exclusive to certain caves?
Model answer: Not all five organisms are present in all three caves. Bin_2 (Sediminibacterium) and Bin_5 (UBA955) are found in all three caves (coverage > 0 in A, B, and C) (0.33p). Bin_3 (CADEEN01) is absent from cave C, and Bin_4 (2-12-FULL-35-15) is absent from cave B (0.33p). Bin_1 (Methyloglobulus) is exclusive to cave A, with zero coverage in both B and C (0.33p).
Question 4 (1p):
Grading: Two subquestions, 0.5p each:
- Which cave has the highest diversity? (0.5p)
- Which cave has the lowest diversity? (0.5p)
Note: Diversity = richness (number of families) + evenness (how evenly distributed abundances are). You are not expected to make this distinction explicitly, but it is the complete answer.
Partial credit:
- Stating cave A has the highest number of families β 0.5p (awarded even without mention of evenness)
- β0.25p if you state cave B has more families than cave C (the sixth family in cave C is nearly invisible in the barplot, so I will not retrieve the full -0.5p)
- +0.25p bonus if the student mentions evenness in addition to richness
Model answer: Cave A exhibits the highest richness with nine distinct families detected in the Ribosomal S6 family barplot. Caves B and C each show six families. Cave A and cave C are both dominated by two highly abundant families, while cave B shows a more even distribution of abundances across its six families. Based on family richness alone, the ranking is: cave A (highest) > caves B and C (equal, six families each). When evenness is also considered, the full diversity ranking becomes: cave A (highest richness) > cave B (lower richness than A, but higher evenness than C) > cave C (lowest, due to both low evenness and equal richness to B).
Question 5
Grading:
- Any plausible flow diagram proposed (even if incorrect) β 0.25p
- Correct reasoning but incorrect flow direction (e.g., "cave A has the highest number of taxa, therefore water exits there and gathers organisms from B and C") β 0.75p
- Correct flow (B β A and C β A) with correct interpretation β 1p
Model answer: I hypothesize that cave A is nearer to the cave exit, with water from caves B and C flowing as separate upstream sources that converge at cave A. This is supported by both the high-quality genome data and the Ribosomal S6 data (family-level): all taxa observed in caves B and C are also present in cave A, whereas cave A additionally hosts organisms found in neither B nor C. Crucially, the taxa unique to cave B and the taxa unique to cave C do not overlap with each other. Cave A therefore holds the highest diversity, representing the combined and non-overlapping diversity of caves B and C, indicating it acts as a confluence point where the two separate water sources meet. The answer is thus: B β A and C β A.
Additional information:
| Family | Cave A coverage | Cave B coverage | Cave C coverage |
|---|---|---|---|
| 2-12-FULL-35-15 | 5.36 | 0 | 2.19 |
| Burkholderiaceae_B (CADEEN01) | 9.55 | 7.60 | 0 |
| Chitinophagaceae (Sediminibacterium) | 94.43 | 278.89 | 228.17 |
| CSP1-5 | 2.61 | 0 | 0 |
| Methylomonadaceae (Methyloglobulus) | 62.76 | 0 | 0 |
| Methylomonadaceae (Methylovulum) | 7.11 | 0 | 0.70 |
| Methylomonadaceae (Methylomonas koyamae_C) | 6.08 | 0 | 0 |
| Methylomonadaceae (Methylomonas sp009925045) | 7.52 | 0 | 0 |
| Methylophilaceae (Methylotenera) | 4.59 | 108.63 | 0 |
| Rhodobacteraceae (Tabrizicola_A) | 6.02 | 227.37 | 319.28 |
| UA16 (CAMCTV01) | 4.40 | 268.11 | 25.77 |
| UBA955 | 8.96 | 134.31 | 13.44 |
| Flavobacteriaceae (Flavobacterium terrigena) | 0 | 0 | 0 |
The opposite flow (A β B and A β C) does not match the data. If the river were to exit at caves B and C, its diversity would have to split as the water diverges β with some taxa selectively flowing into B and others into C. However, there is no biological mechanism that would cause such partitioning: all three sites share the same environment (flowing water with comparable conditions), meaning all organisms are equally likely to be carried in any direction by the current. This argument would not hold if sediment were sampled instead, as sediment communities are more spatially constrained and locally shaped, but for water-borne microorganisms in a connected river system, selective partitioning at a divergence point is biologically implausible. What we actually observe β that B and C each carry a different, non-overlapping subset of cave A's diversity β is precisely what you would expect if B and C are independent entry points whose communities accumulate at the exit point in cave A.
Any other chain flow such as A β B β C is also incorrect. As shown on the map provided in the assignment (blue lines indicate streams), caves B and C are not directly connected. Any chain topology would require a direct hydrological link between B and C, which the map does not show.