CBW 2021 Metagenomic Taxonomic and Functional Composition Tutorial Answers - LangilleLab/microbiome_helper GitHub Wiki

These are the answers for the Metagenomic Taxonomic and Functional Composition Tutorial created for the 2021 microbiome data analysis Canadian Bioinformatics Workshop.

  1. There should be the same number of reads in the reverse FASTQ for each sample. So there should be 100000/4 = 25000 for sample CSM79HR8 and 100400/4 = 25100 for sample HSM7J4QT.

  2. Only 119 reads were removed due to matching the human and/or PhiX genomes, which again highlights that this data has already been stringently filtered.

  3. Researchers have different preferences and opinions about how raw data should be pre-processed. It's important to upload all the raw data so that your work can be fully reproducible and so different pipelines could be used.

  4. The text {/.}.kraken.txt and {/.}.kreport is what is used to indicate the name of the output files from our kraken2 command. Remember that the text {} is replaced by the input file read by the argument at the end of our command ::: cat_reads/*.fastq. By including a /. within {} it means we want to remove the full address of the input file and only keep the file name rather than its whole PATH.

  5. We could figure out the total number of taxa that were identified in all of our samples using the wc -l command and subtracting one. wc -l bracken_out_merged/merged_output.species.bracken. This results in a total of 248 taxa being identified.

  6. The sequence HKWJVBCXY170606:2:2116:8029:10262/1 aligned with 25 protein sequences (the maximum number we allowed in our mmseqs command). There are four protein sequences that share the highest bitscore/lowest E-value with this sequence. UniRef90_A7LXV1 UniRef90_A7LXV1 UniRef90_A0A174F8K1 UniRef90_A0A1F0I3S4.

  7. The RPKM of EC 2.1.2.9 (Methionyl-tRNA formyltransferase) contributed by Bacteroides vulgatus is 985.117.

  8. The enzyme with EC number 6.1.1.4 is named Leucine--tRNA ligase.

  9. We can get the total number of pathways identified by examining the total number of entries into the unstratified table using this command: zcat pathways_out/path_abun_unstrat.tsv.gz | wc -l Remember that we need to remove 1 (for the header line). Therefore in total there are 4 pathways identified.

  10. Inspecting the stratified pathway abundances using less pathways_out/path_abun_strat.tsv.gz. We will see that there are two taxa that contribute the the PANTO-PWY pathway.