PICRUSt2 - Michael-D-Preston/PrestonLab GitHub Wiki

By Angus Ball

Introduction

Picrust2 isn't actually that bad. They've taken quite a few steps to make sure its very easy to use. That being said its an R/linux mix program so make sure you have access to a linux box of some kind.

This tutorial will be in three parts

Getting your data into a Picrust format (an otu table)
Running PICRUSt2 on your linux box (or HPC)
Running the analysis of the PICRUSt2 data in R

If you're emotionally ready (I'm not) heres some pre-reading

First go read this paper: PICRUSt2 paper

Here's the github: PICRUSt2 github

Heres a list of limitations with this kind of analysis

A link to the main tutorial

Part one the Crying begins: Getting your data into an OTU format

It's harder than it looks! Click this link

Part two maybe it'll be okay?: running PICRUSt2

quick reference to how I'll format this part

This is the step

This is the code you'll run, note the copy button --->

This is the output

Step 2
- These are bonus facts that I want to say,
- or sub steps

installation of picrust

Load up your linux box and install picrust

wget https://github.com/picrust/picrust2/archive/v2.6.2.tar.gz
tar xvzf  v2.6.2.tar.gz
cd picrust2-2.6.2/
conda env create -f picrust2-env.yaml
conda activate picrust2
pip install --editable .

PS that period is important

running the pipeline

Get in your picrust conda environment and run this command with your OTU and sequence files in the same folder you're in

picrust2_pipeline.py -s rep-seqs.fna -i OTU.txt -o picrust2_out_pipeline -p 10 --stratified --verbose

it... just worked? I didnt need to make another tutorial page but alas

Pretty much its aying run the picrust python script (pipeline.py) with the sample info and otu table being rep-seqs.fna and otu.txt respectfully. Then the output (-o) iis a folder called picrust2_out_pipeline. Use 10 processors (-p 10) to run this (change it to the amount you have). include the stratified files (--stratified) and tell me about the progress (--verbose)

part three moments of calm: analyzing PICRUSt2 data in R

So you want to analyze the picrust2 data, but you just got an absolute load of data and what does it all mean! Well the first major choice it seems it to use the KO (Kegg) data or the EC/MetaCyc data. Kegg seems to have a larger database than Metacyc, BUT after trying to use kegg it mapped to some weird pathways and the database seemed to be biased towards human pathways, not bacterial. So I recommend using Metacyc. An interesting thing to note is this data is formated almost exactly like an OTU and taxa table that phyloseq expects, and yknow what? All my analyzes are based on phyloseq objects so we're just going to convert the picrust data into a phyloseq object and then you can go into the various tutorials and just rerun them with this phyloseq data instead!

this is how to convert your picrust data into a phyloseq object, but if we're being honest you know how to do this already

PS. I tried to use a program called ggpicrust. its an R package intentially made to analyze picrust data. Guess what? didn't work for squat (03-2024), so try at your own risk if you think you need to

Citations to cite PICRUSt2:

Based on this link

PICRUSt2

Douglas, G.M., Maffei, V.J., Zaneveld, J.R., Yurgel, S.N., Brown, J.R., Taylor, C.M., Huttenhower, C., Langille, M.G.I., 2020. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol 38, 685–688. https://doi.org/10.1038/s41587-020-0548-6

HMMER

EPA-NG

Barbera, P., Kozlov, A.M., Czech, L., Morel, B., Darriba, D., Flouri, T., Stamatakis, A., 2018. Data from: EPA-ng: massively parallel evolutionary placement of genetic sequences. https://doi.org/10.5061/DRYAD.KB505NC

gappa

Czech, L., Barbera, P., Stamatakis, A., 2020. Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data. Bioinformatics 36, 3263–3265. https://doi.org/10.1093/bioinformatics/btaa070

castor

Louca, S., Doebeli, M., 2018. Efficient comparative phylogenetics on large trees. Bioinformatics 34, 1053–1055. https://doi.org/10.1093/bioinformatics/btx701

SEPP

Mirarab, S., Nguyen, N., Warnow, T., 2011. SEPP: SATé-Enabled Phylogenetic Placement, in: Biocomputing 2012. Presented at the Proceedings of the Pacific Symposium, WORLD SCIENTIFIC, Kohala Coast, Hawaii, USA, pp. 247–258. https://doi.org/10.1142/9789814366496_0024

MinPath

Ye, Y., Doak, T.G., 2009. A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes. PLoS Comput Biol 5, e1000465. https://doi.org/10.1371/journal.pcbi.1000465

ggpicrust2

Chen Yang, Aaron Burberry, Jiahao Mai, Liangliang Zhang. (2023). ggpicrust2: an R package for PICRUSt2 predicted functional profile analysis and visualization. arXiv preprint arXiv:2303.10388.