PICRUSt2 - Michael-D-Preston/PrestonLab GitHub Wiki
By Angus Ball
Introduction
Picrust2 isn't actually that bad. They've taken quite a few steps to make sure its very easy to use. That being said its an R/linux mix program so make sure you have access to a linux box of some kind.
This tutorial will be in three parts
-
Getting your data into a Picrust format (an otu table)
-
Running PICRUSt2 on your linux box (or HPC)
-
Running the analysis of the PICRUSt2 data in R
If you're emotionally ready (I'm not) heres some pre-reading
First go read this paper: PICRUSt2 paper
Here's the github: PICRUSt2 github
Heres a list of limitations with this kind of analysis
A link to the main tutorial
Part one the Crying begins: Getting your data into an OTU format
It's harder than it looks! Click this link
Part two maybe it'll be okay?: running PICRUSt2
quick reference to how I'll format this part
- This is the step
This is the code you'll run, note the copy button --->
This is the output
- Step 2
- These are bonus facts that I want to say,
- or sub steps
installation of picrust
Load up your linux box and install picrust
wget https://github.com/picrust/picrust2/archive/v2.6.2.tar.gz
tar xvzf v2.6.2.tar.gz
cd picrust2-2.6.2/
conda env create -f picrust2-env.yaml
conda activate picrust2
pip install --editable .
PS that period is important
running the pipeline
Get in your picrust conda environment and run this command with your OTU and sequence files in the same folder you're in
picrust2_pipeline.py -s rep-seqs.fna -i OTU.txt -o picrust2_out_pipeline -p 10 --stratified --verbose
it... just worked? I didnt need to make another tutorial page but alas
Pretty much its aying run the picrust python script (pipeline.py) with the sample info and otu table being rep-seqs.fna and otu.txt respectfully. Then the output (-o) iis a folder called picrust2_out_pipeline. Use 10 processors (-p 10) to run this (change it to the amount you have). include the stratified files (--stratified) and tell me about the progress (--verbose)
part three moments of calm: analyzing PICRUSt2 data in R
So you want to analyze the picrust2 data, but you just got an absolute load of data and what does it all mean! Well the first major choice it seems it to use the KO (Kegg) data or the EC/MetaCyc data. Kegg seems to have a larger database than Metacyc, BUT after trying to use kegg it mapped to some weird pathways and the database seemed to be biased towards human pathways, not bacterial. So I recommend using Metacyc. An interesting thing to note is this data is formated almost exactly like an OTU and taxa table that phyloseq expects, and yknow what? All my analyzes are based on phyloseq objects so we're just going to convert the picrust data into a phyloseq object and then you can go into the various tutorials and just rerun them with this phyloseq data instead!
PS. I tried to use a program called ggpicrust. its an R package intentially made to analyze picrust data. Guess what? didn't work for squat (03-2024), so try at your own risk if you think you need to
Citations to cite PICRUSt2:
Based on this link
PICRUSt2
Douglas, G.M., Maffei, V.J., Zaneveld, J.R., Yurgel, S.N., Brown, J.R., Taylor, C.M., Huttenhower, C., Langille, M.G.I., 2020. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol 38, 685β688. https://doi.org/10.1038/s41587-020-0548-6
HMMER
HMMER 3.4 (Aug 2023); Copyright (C) 2023 Howard Hughes Medical Institute; http://hmmer.org/
EPA-NG
Barbera, P., Kozlov, A.M., Czech, L., Morel, B., Darriba, D., Flouri, T., Stamatakis, A., 2018. Data from: EPA-ng: massively parallel evolutionary placement of genetic sequences. https://doi.org/10.5061/DRYAD.KB505NC
gappa
Czech, L., Barbera, P., Stamatakis, A., 2020. Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data. Bioinformatics 36, 3263β3265. https://doi.org/10.1093/bioinformatics/btaa070
castor
Louca, S., Doebeli, M., 2018. Efficient comparative phylogenetics on large trees. Bioinformatics 34, 1053β1055. https://doi.org/10.1093/bioinformatics/btx701
SEPP
Mirarab, S., Nguyen, N., Warnow, T., 2011. SEPP: SATΓ©-Enabled Phylogenetic Placement, in: Biocomputing 2012. Presented at the Proceedings of the Pacific Symposium, WORLD SCIENTIFIC, Kohala Coast, Hawaii, USA, pp. 247β258. https://doi.org/10.1142/9789814366496_0024
MinPath
Ye, Y., Doak, T.G., 2009. A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes. PLoS Comput Biol 5, e1000465. https://doi.org/10.1371/journal.pcbi.1000465
ggpicrust2
Chen Yang, Aaron Burberry, Jiahao Mai, Liangliang Zhang. (2023). ggpicrust2: an R package for PICRUSt2 predicted functional profile analysis and visualization. arXiv preprint arXiv:2303.10388.