Hidden state prediction - picrust/picrust2 GitHub Wiki

PICRUSt2 wraps the castor R package to run hidden-state prediction (hsp) to predict gene family abundances. See below for how the --chunk_size and -p options can be used to maximize memory and run-time efficiency.

Please note that before PICRUSt2-v2.6.0 the default running of this command was with the PICRUSt2-oldIMG database. As of PICRUSt2-v2.6.0 the default database will be the PICRUSt2-SC database. See here for further details on this new database. See the details for the -r and the -db options for using the PICRUSt2-oldIMG database with PICRUSt2-v2.6.0.

Hidden-state prediction for 16S copy, E.C. numbers, and KO abundances per-genome can be run with these commands:

hsp.py -i 16S -r bacteria -t bac_placed_seqs.tre -o bac_marker_nsti_predicted.tsv.gz -p 1 -n
hsp.py -i 16S -r archaea -t arc_placed_seqs.tre -o arc_marker_nsti_predicted.tsv.gz -p 1 -n

hsp.py -i EC -r bacteria -t bac_reduced_placed_seqs.tre -o bac_EC_predicted.tsv.gz -p 1
hsp.py -i EC -r archaea -t arc_reduced_placed_seqs.tre -o arc_EC_predicted.tsv.gz -p 1

hsp.py -i KO -r bacteria -t bac_reduced_placed_seqs.tre -o bac_KO_predicted.tsv.gz -p 1
hsp.py -i KO -r archaea -t arc_reduced_placed_seqs.tre -o arc_KO_predicted.tsv.gz -p 1

Note that for EC number and KO abundance, it is recommended that you first determine the best domain for each of your sequences using the pick_best_domain.py script (details here). See details of the recommended order for commands to be run here.

The input arguments and options are:

  • -t TREEFILE: Newick tree with study sequences placed amongst reference sequences.
  • -i TRAIT_OPTION: Which default pre-calculated count table to use (one of '16S', 'BIGG', 'CAZY', 'EC', 'GENE_NAME', 'GO', 'KO', 'PFAM' with the new SC database or one of '16S', 'COG', 'EC', 'KO', 'PFAM', 'TIGRFAM', 'PHENO' with the oldIMG database)
  • -r DOMAIN: Which of the default reference sets to use. Can be 'bacteria'/'bac' or 'archaea'/'arc'.
  • -db DATABASE: Whether to use the oldIMG or the updated SC database for predictions. Default is SC and will use the option given for -r. If -db oldIMG then -r is not used.
  • -o OUTPUT: Named of output file containing predicted counts.
  • --observed_trait_table TRAIT_COUNTFILE: Trait file to use if a non-default file is needed (most users should use one of the default trait options above).
  • -m METHOD: Hidden-state prediction method to use, which needs to be one of: maximum parsimony (mp), empirical probabilities (emp_prob), subtree averaging (subtree_average), phylogenetic independent contrast (pic), or squared-change parsimony (scp).
  • -p INT: Number of processes to run in parallel.
  • -e float - Setting for maximum parisomony hidden-state prediction. Specifies weighting transition costs by the inverse length of edge lengths. If 0, then edge lengths do not influence predictions. Must be a non-negative real-valued number (default: 0.5).
  • --chunk_size INT: Number of gene families to read in for a given processor. Note that increasing this value will not always speed up execution! The trait table is split into subsets of size chunk_size before running hidden state prediction to reduce memory usage. Each subset can be run through on a different processor, but the number of simultaneous processors is equal to the number of data subsets! For example, if there are 1500 total gene families and the chunk_size is 500, the maximum number of processes that can be run is parallel is 3 (which will be the case even if you set something higher!).
  • -n: Indicates that Nearest-sequenced taxon index (NSTI) values should be calculated. This metric can be used to identify study sequences that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). By default the maximum NSTI cut-off for subsequent commands is 2. Any study sequences with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences. Users can visualize the NSTI distribution for their placed reads to help determine the best cut-off for their dataset for the next steps.

As of version 2.6.0, the file containing the NSTI values will also contain a column that gives details of the closest reference genome to each study sequence. You can find more details on these reference genomes in picrust2/default_files.