PICRUSt2‐SC database - picrust/picrust2 GitHub Wiki

As of PICRUSt2-v2.6.0, the default database used by PICRUSt2 has been updated. The information on this Wiki has been updated to reflect this. The previous database using IMG genomes is still present within PICRUSt2 and can still be used for functional predictions (see the information here for information on how to do this) for now, although we may look to remove this in the future.

You can find full details on the new database in our preprint here.

This page has details on how to install and run this new database, the improvements that this database has, and how it has been constructed. If you want to see all of the code that can be used to construct this database, see this page. Aside from this database containing different genomes, one major change is that where the default PICRUSt2 database contained one phylogenetic tree that had both bacteria and archaea, this updated database contains a tree for each of bacteria and archaea. This means that some of the steps need to be run more than once and the outputs of these separate runs combined. All steps can be run together in the picrust2_pipeline.py script. If you want to run the previous database, PICRUSt2-oldIMG, you can do this using the picrust2_pipeline_oldIMG.py script.

The genomes within the PICRUSt2 database have now been annotated using EggNOG - this means that predictions can be obtained for EC numbers, KEGG orthologs, BiGG reactions, Carbohydrate-Active Enzymes (CAZy), gene names, GO annotations and Pfam's. We also now provide instructions for adding custom traits/annotations to the default database.

The PICRUSt2-SC database

This database uses Genome Taxonomy Database (GTDB) r214 genomes. r214 of GTDB contained 402,709 in 85,205 species clusters. We annotated all 85,205 of the genomes using Eggnog v2.1.2, and 27,870 of these (26,868 bacteria and 1,002 archaea) meet the quality criteria for inclusion. This is an almost 1.4x increase in the number of genomes over the previous PICRUSt2 database, PICRUSt2-oldIMG, with the number of archaeal genomes more than doubling (19,493 bacteria and 406 archaea in PICRUSt2-oldIMG). Information on all of the included genomes can be found in the *_metadata.csv.gz files within default_files/bacteria and default_files/archaea. We use the phylogenetic trees that are released with GTDB for sequence insertion. The database now contains BiGG reaction, CAZy, EC numbers, gene name, GO, KO and Pfam annotations. This gives ~1.3-fold more KOs and EC numbers than the previous database. We verified the performance of the new PICRUSt2-SC database using simulated samples constructed using genomes that were not present in the updated database. The median weighted Nearest Sequences Taxon Index (NSTI) was lower for all datasets with the PICRUSt2-SC database than with the PICRUSt2-oldIMG database (average 0.069 vs 0.099), with the largest improvements being seen in the Blueberry soil, Cameroon and Primate datasets. The median Spearman's correlation coefficients are higher (0.802 vs 0.757) and Bray-Curtis dissimilarity indices are lower (0.291 vs 0.341) for the PICRUSt2-SC vs the PICRUSt2-oldIMG database. We will be releasing a preprint soon with more details.

All commands used for constructing the database are here.

overall_figure Figure 1. Comparison of the PICRUSt2-oldIMG and PICRUSt2-SC databases showing: (a) the steps in the construction of the PICRUSt2-SC database; (b) the number of functions annotated within different frameworks for the default and updated databases (note that not all frameworks were included in both databases); (c) the number of taxa included for each step of the PICRUSt2-SC construction (top) and for each phylogenetic rank (bottom) for bacteria and archaea; (d) composition at the class level for the simulated samples (the mean relative abundance is shown for each dataset); and (e) the performance of the PICRUSt2-oldIMG and PICRUSt2-SC databases on the simulated samples from each dataset and overall (bottom). Spearman’s correlation and Bray-Curtis dissimilarity is shown for KEGG orthologs. Individual points are shown for each sample with points being coloured pink for the PICRUSt2-oldIMG database and yellow for the PICRUSt2-SC database. Boxplots represent the median, upper and lower quartiles and whiskers show the range of the data (1.5 times the Interquartile Range) and values in boxes are medians. The results for T-tests between the PICRUSt2-oldIMG and PICRUSt2-SC are shown with grey shading for significant (p <= 0.05) tests.

Changes to the steps run by PICRUSt2

Running the steps involved in PICRUSt2 now involves a few extra steps. The steps for running the previous default database, PICRUSt2-oldIMG, are (the links will take you to a page detailing each of the steps):

Place sequences into reference tree (details)
Run hidden-state prediction for 16S copy numbers, KOs and EC numbers (details)
Predict KOs and EC number abundances in metagenome (details)
Predict pathway abundances and coverage (details)

Because there are now two phylogenetic trees and two sets of functional trait tables, the steps for PICRUSt2 are now:

Place sequences into bacterial and archaeal reference trees
1. Place sequences into reference bacterial tree
2. Place sequences into reference archaeal tree
Run hidden state prediction for 16S copy numbers, KOs and EC numbers
1. Run hidden-state prediction for 16S copy numbers for bacteria
2. Run hidden-state prediction for 16S copy numbers for archaea
3. Determine the best domain for each sequence (lowest NSTI) (details)
4. Run hidden-state prediction for KOs and EC numbers separately for bacteria and archaea, with only the sequences that fit best with each domain
5. Combine bacterial and archaeal predictions for each of KOs and EC numbers (details)
Predict KOs and EC number abundances in metagenome (details)
Predict pathway abundances and coverage