Metagenomics (answers; IMPACTT 2022) - LangilleLab/microbiome_helper GitHub Wiki

These are answers for the 2022 IMPACTT bioinformatics workshop running from May 30th - June 1st.

Authors: Robyn Wright and Morgan Langille

This tutorial has been adapted from the previous 2021 version written for the 2021 Canadian Bioinformatics Workshop by Jacob Nearing and Morgan Langille.

The answers shown here are for the Metagenomic Taxonomic Composition Tutorial and the Metagenomic Functional Composition Tutorial.

Taxonomic composition

Question 1: Based on the above line counts can you tell how many reads should be in the reverse FASTQs? Remember that a standard convention is to mark the reverse FASTQs with the characters R2

The number of reads should be the same in the reverse FASTQs as it is in the forward FASTQs.

Question 2: Take a look at kneaddata_read_counts.txt: how many reads in sample CSM7KOMH were dropped?

You can look at the number of reads in the "final pair1" or "final pair2" columns for this - they should show 24,007 reads, meaning that 993 reads were dropped.

Question 3: It's reasonable that the processed data was provided by IBDMDB since their goal is to provide standardized data that can be used for many projects. However, why is it important that normally all raw data be uploaded to public repositories?

There are a few different reasons that it's important that all raw data be uploaded to repositories, primarily that someone else should be able to verify your results by repeating exactly what you did, but also the best standards for processing sequencing data are constantly evolving and someone else may want to look at something different in those samples. For example, you may have looked at only the bacterial reads in your samples, but someone else may be interested in the archaeal, fungal or viral reads.

Question 4: What does the text {1/.}.kraken.txt and {1/.}.kreport do in the above command? (Hint look at our above discussion of parallel).

It removes the directory (cat_reads/) and file suffix (.fastq) from the file names, and also adds on .kraken.txt or .kreport to these file names.

Question 5: How many more reads in each sample are classified using the full RefSeq database (RefSeqCompleteV205) than with the minikraken database (with no confidence threshold)? *Hint: you can use the less command to look at both the .kreport and the .kraken.txt files.

For example, if we look at the CSM7KOMH_0.0_minikraken.kreport and CSM7KOMH_0.0_RefSeqCompleteV205.kreport files, we can see that with the minikraken database, 33,117 reads or 68.97% of reads were classified, while with the RefSeqCompleteV205 database, 49,463 reads or 98.98% of reads were classified.

Question 6: How many more reads in each sample are classified using a confidence threshold of 0.0 compared with a confidence threshold of 0.1 for the minikraken database?

For example, if we look at the same samples but with a confidence threshold of 0.1 (CSM7KOMH_0.1_minikraken.kreport and CSM7KOMH_0.1_RefSeqCompleteV205.kreport), then we find that 26,558 reads or 55.31% of reads were classified with the minikraken database, while 49,409 reads or 98.87% of reads were classified with the RefSeqCompleteV205 database. So we can see that there was a fairly large decrease in the number of reads classified with the minikraken database when we went from a confidence threshold of 0.0 to 0.1, while with the RefSeqCompleteV205 database, this difference was very small. As you can see, we need to take into account both the database and the confidence threshold when choosing which to use for your samples. If you want to read more about this, you can do so in our preprint.

Question 7: Take a look at the raw output for the same sample but run with a confidence threshold of 0.1. Was the second read classified? Why or why not?

In this example, you should see that the second read with a confidence threshold of 0.0 reads: C CB1APANXX170426:1:1101:10001:37312/1 Bacteroides (taxid 816) 101 0:57 816:5 0:4 816:1 And with a confidence threshold of 0.1 reads: U CB1APANXX170426:1:1101:10001:37312/1 unclassified (taxid 0) 101 0:57 816:5 0:4 816:1

If we calculate the confidence as we had before for the taxid 816, which the read is initially classified as, then this gives us (5+1)/(57+5+4+1) = 6/67 = 0.089, so this is below the confidence threshold of 0.1. Because the other k-mers within the read were unclassified, this read becomes unclassified when we set the confidence threshold to 0.1.

Question 8: How would you go about figuring out the total number of species we found within our ten different samples?

You can simply count the number of lines within the Bracken summary files, like so:

wc -l bracken_out_merged/merged_output_minikraken.species.bracken
wc -l bracken_out_merged/merged_output_RefSeqCompleteV205.species.bracken

Remember that these files also have a header row, so we'll need to subtract one from each, giving 415 species in the MiniKraken-classified samples and 2,381 species in the RefSeqCompleteV205-classified samples.

Question 9: How does the number of species classified by each of the confidence thresholds differ, and how does it differ between the minikraken and the RefSeqCompleteV205 databases?

You can use some commands like this to see the number of species classified within each file, in different groups at a time:

wc -l bracken_out/*minikraken*
wc -l bracken_out/*RefSeqCompleteV205*
wc -l bracken_out/*0.0_minikraken*
wc -l bracken_out/*0.1_minikraken*
wc -l bracken_out/*0.0_RefSeqCompleteV205*
wc -l bracken_out/*0.1_RefSeqCompleteV205*

Functional composition

Question 1: How many protein sequences did the sequence CB1APANXX170426:1:1101:2664:65700/2 align with in the sample CSM7KOMH? Which alignment/alignments have the lowest E-value/highest bitscore?

There were two alignments to that sequence in sample CSM7KOMH. The first one, matching UniRef90_A0A081U0T2, has a lower E-value (2.973E-11) than the other match, UniRef90_A0A2M9V2V5 (E-value of 4.088E-11), but the bitscore was the same for both of them (64). If we want to see only the alignments for a single sequence, we can use grep, like so:

grep "CB1APANXX170426:1:1101:2664:65700/2" mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8

grep will print out all lines of the file where the string that we gave it are found.

Question 2: What is the RPKM contributed to the sample CSM79HR8 for the EC:6.3.5.2 contributed by Bacteroides intestinalis?

You should be looking at the second column of the row named "EC:6.3.5.2|Bacteroides (taxid 816)", so this is 425.6836791354688.

Question 3: What is the name of the enzyme with the EC number EC:6.3.5.2?

EC:6.3.5.2: GMP synthase (glutamine-hydrolyzing).

Question 4: How many total pathways were identified?

We can use zcat like we did previously to help us count this, although there aren't too many lines in this file so you would be able to just count them: zcat pathways_out/path_abun_unstrat.tsv.gz | wc -l

This gives us 23, but remember that the header is also included in this count, so the answer is 22.

Question 5: how many taxa contribute to the pathway PANTO-PWY?

You can just count these manually, showing us that there are 9, but we can also use the grep command again to check this: grep -c "PANTO-PWY" pathways_out/path_abun_strat.tsv or grep "PANTO-PWY" pathways_out/path_abun_strat.tsv In the first command, the -c flag will count the number of times that this string is found, while leaving out the -c will just print the lines where the string is found.

Question 6: Which samples is the Streptococcus genus found in?

The Streptococcus genus is only found in the CD samples.

Question 7: Are there any pathways that are only found in one sample and not the other?

This is a bit difficult to tell with so many lines (and would be even more difficult in a sample that hadn't had the reads down-sampled). If we want to, we can go back to the files and double check anything that we think we see there, either using the less command, or we can also use the grep command again:

grep "PWY-6385" pathways_out/path_abun_strat.tsv
grep - "CSM79HR8" pathways-stratified-SankeyFormat.txt