CBW 2018 PICRUSt2 Tutorial Answers - LangilleLab/microbiome_helper GitHub Wiki

Answers for the PICRUSt2 tutorial presented as part of the 2018 Canadian Bioinformatics Workshop on microbiome data analysis.

R version 3.4.1 is installed in the picrust2-dev conda environment and the R script is located at /home/ubuntu/.conda/envs/picrust2-dev/bin/R
There are 36 samples. You could either count the rows of metadata table or type: wc -l input_files/picrust2_lab_metadata.tsv (which would include the header-line). Alternatively you could get the number of columns in the sequence abundance table with this command:

awk '{print NF}' input_files/ASV_abun.tsv | head -n 1
You can count how many sequences there are in a FASTA file by counting how many header-lines (i.e. lines that begin with ">") there are:
```
grep -c ">" input_files/ASVs.fna
```
The default input MSA and tree are /usr/local/picrust2/default_files/prokaryotic/img_centroid_16S_aligned.fna and /usr/local/picrust2/default_files/prokaryotic/img_centroid_16S_aligned.tree, respectively. There are 16444 sequences in the MSA.
Setting a random seed will make a command that has some randomness to be reproducible each time you run it on the same input data. This is important to set in bioinformatic pipelines so that your work can be reproducible by someone else and/or your future self!
The lowest NSTI value is 0.00010, which you can figure out with this R command: min(hsp_16S_nsti$metadata_NSTI).
The genome represented by sequence 3bc9d66614c8c98d398ace7483422449 is predicted to have 6 copies of the 16S rRNA gene.
The normalized abundance is 3.67, which means that there must have been 3 predicted marker genes. (11/3 = 3.67)
The column sums wouldn't typically be equal since there are different numbers of gene families for each predicted genome. Remember that each predicted genome is based on each input ASV, which will be at variable relative abundances across your samples!
Statement #2 is correct: "The stratified pathway abundances represent the abundances of the community-wide pathway levels contributed by an individual predicted genome." This is important to remember - due to how PICRUSt2 outputs the stratified pathway abundances you can't know whether all the genes necessary for expressing a pathway are present in an individual predicted genome. The pathway inference is done at the community level.
There are 6 possible placement positions in the tree for ASV d3d5bc15a5f947217d626ad4a99c5757.
Since these are human-associated microbial communities they have been well characterized (the majority of reference genomes are from human-associated species).