Standard PICRUSt2 output - picrust/picrust2 GitHub Wiki

This page describes the output that you can expect from running the entire pipeline as well as from running each of the steps individually. It is for PICRUSt2-v2.6.0 onwards, but most of this applies to earlier versions also. They are separated by the step within the pipeline that they are produced. For this, it is important to understand what exactly each part of the PICRUSt2 pipeline is doing, so I have included some details on that also.

Sequence placement

Uses: runs the function place_seqs_pipeline within picrust2/place_seqs.py, either by running scripts/place_seqs.py or within scripts/picrust2_pipeline.py.

Aim: place study sequences into tree containing reference genomes based on alignment to multiple sequence alignment (MSA) of reference sequences.

Note that these steps will be run for both bacteria and archaea by default.

Input: fasta file containing representative nucleotide sequences for study.

Key output:

  • [arc/bac].tre - newick formatted phylogenetic tree containing all reference genomes for this domain along with any study sequences that could be placed within it.

You can see details of the intermediate files created below, but arc.tre and bac.tre are the only files that are needed to continue with the PICRUSt2 pipeline. Note that if you have run each of the commands separately, you could have named this file something different, but the default is for them to be named arc.tre and bac.tre.

Other intermediate output:
  • intermediate/place_seqs_[arc/bac] (entire folder)
  • intermediate/place_seqs_[arc/bac]/query_align.stockholm
  • intermediate/place_seqs_[arc/bac]/ref_seqs_hmmalign.fasta (if using EPA-NG)
  • intermediate/place_seqs_[arc/bac]/study_seqs_hmmalign.fasta (if using EPA-NG)
  • intermediate/place_seqs_[arc/bac]/epa_out/ (entire folder - standard output from EPA-NG)
  • intermediate/place_seqs_[arc/bac]/study_seqs_filtered.fasta (if using SEPP)
  • intermediate/place_seqs_[arc/bac]/sepp_out/ (entire folder - standard output from SEPP)

Note that these are described below in Output obtained from each step.

Output obtained from each step:
  1. Run hmmalign to align study sequences with reference MSA - this is necessary for EPA-NG and is used as a check for both EPA-NG and SEPP pipelines to exclude sequences that align very little
    Output: intermediate/place_seqs_[arc/bac]/query_align.stockholm - multiple sequence alignment in stockholm format
  2. Filter this file to include only sequences that align to the reference (default >0.8 alignment)
  3. EPA-NG: Write fasta files of reference and study sequences that align
    Output: intermediate/place_seqs_[arc/bac]/ref_seqs_hmmalign.fasta and intermediate/place_seqs_[arc/bac]/study_seqs_hmmalign.fasta
  4. EPA-NG: Run EPA-NG using the fasta files of reference and study sequences that align and the reference .model file
    Output: intermediate/place_seqs_[arc/bac]/epa_out/epa_info.log, intermediate/place_seqs_[arc/bac]/epa_out/epa_result.jplace and intermediate/place_seqs_[arc/bac]/epa_out/epa_result_parsed.jplace (these are the standard output files from EPA-NG)
  5. SEPP: Write fasta containing study sequences only
    Output: intermediate/place_seqs_[arc/bac]/study_seqs_filtered.fasta
  6. SEPP: Run SEPP using the fasta file containing the study sequences that align to the reference and the reference .model file
    Output: intermediate/place_seqs_[arc/bac]/sepp_out/
  7. Convert jplace output file to newick tree format Output: [arc/bac].tre

Hidden-state prediction for 16S copy number and NSTI calculation

Uses: runs the function castor_hsp_workflow within picrust2/wrap_hsp.py, either by running scripts/hsp.py or within scripts/picrust2_pipeline.py.

Aim: perform hidden state prediction on tips in the input tree with unknown trait values. Note that it is used here for prediction of 16S copy number per sequence in the input and will be used again for functional abundance prediction. It will also by default calculate the Nearest Sequenced Taxon Index (unless specified otherwise).

Note that these steps will be run for both bacteria and archaea by default.

Input: newick-formatted tree file output from the sequence placement step that contains reference genomes as well as study sequences.

Key output:

  • [arc/bac]_marker_predicted_and_nsti.tsv.gz - tab-separated file containing four columns. In the first column there will be ASV/sequence names, in the second, the predicted 16S copy number, in the third, the Nearest Sequenced Taxon Index (NSTI - the distance between each study sequence and the nearest reference sequence), and in the fourth, the name of the closest reference genome in the database (you can look up details of this in the metadata files within the default files provided for bacteria and archaea).

You can see details of the steps carried out below, but these files - arc_marker_predicted_and_nsti.tsv.gz and bac_marker_predicted_and_nsti.tsv.gz are the only output from this step. Note that scripts/hsp.py would have to be run twice to replicate the default behaviour within scripts/picrust2_pipeline.py.

Output obtained from each step:
  1. Calculate NSTI values using the R package Castor (unless specified otherwise)
  2. Run Hidden State Prediction using the R package Castor
    Output: [arc/bac]_marker_predicted_and_nsti.tsv.gz

Determine the best domain for each sequence

Uses: runs the function get_lowest_nsti within picrust2/split_domains.py, either by running scripts/pick_best_domain.py or within scripts/picrust2_pipeline.py.

Aim: determine whether the best domain is bacteria or archaea for each input sequence.

Input:

  • arc.tre - output from sequence placement for archaea
  • bac.tre - output from sequence placement for bacteria
  • arc_marker_predicted_and_nsti.tsv.gz - output from hidden state prediction for archaea
  • bac_marker_predicted_and_nsti.tsv.gz - output from hidden state prediction for bacteria

Key output:

  • combined_marker_nsti_predicted.tsv.gz - tab-delimited file containing details of which tree each of the study sequences has been placed in. The file should contain four columns - ASV/sequence name, 16S copy number (for the best-fitting domain), NSTI (for the best-fitting domain), and the best domain (if the NSTI is equal for both domains, whichever was input first will be used - bacteria by default).
  • arc_reduced.tre - input tree filtered to only contain study sequences which fit best into the archaeal tree
  • bac_reduced.tre - input tree filtered to only contain study sequences which fit best into the bacterial tree
  • arc_reduced_marker_predicted_and_nsti.tsv.gz - input file from above filtered to only contain study sequences that fit best into the archaeal tree
  • bac_reduced_marker_predicted_and_nsti.tsv.gz - input file from above filtered to only contain study sequences that fit best into the bacterial tree

You can see details of the steps that are run to create these files below. They will all be used in the next steps.

Output obtained from each step:
  1. For each sequence, determine which of the domains has the lowest NSTI and then filter the input tables to include only the sequences that fit best in each domain
    Output: combined_marker_nsti_predicted.tsv.gz, arc_reduced_marker_predicted_and_nsti.tsv.gz and bac_reduced_marker_predicted_and_nsti.tsv.gz
  2. Prune the trees to include only the sequences that fit best in each domain
    Output: arc_reduced.tre and bac_reduced.tre

Hidden-state prediction for EC number and KO abundances (or other traits given as input)

Uses: runs the function castor_hsp_workflow within picrust2/wrap_hsp.py, either by running scripts/hsp.py or within scripts/picrust2_pipeline.py.

Aim: perform hidden state prediction on tips in the input tree with unknown trait values. Note that it is used here for prediction of EC number and KO abundances (or other functional category) per sequence in the input. This time it will not be used for NSTI prediction (unless specified otherwise).

Note that these steps will be run for both bacteria and archaea by default.

Input: reduced newick-formatted tree file output from determining which domain is the best fit for each sequence. As above, it contains reference genomes as well as study sequences.

Key output:

  • [arc/bac]_[EC/KO/other]_predicted.tsv.gz - tab-separated file containing predicted abundances for all EC numbers/KOs/other functions within the reference database (columns) in each ASV/sequence (rows).

You can see details of the steps carried out below, but these files - arc_EC_predicted.tsv.gz, arc_KO_predicted.tsv.gz, bac_EC_predicted.tsv.gz, bac_KO_predicted.tsv.gz or any other functional category given is the only output from this step. Note that scripts/hsp.py would need to be run four separate times to replicate the default behaviour within scripts/picrust2_pipeline.py.

Output obtained from each step:
  1. Run Hidden State Prediction using the R package Castor
    Output: [arc/bac]_[EC/KO/other]_predicted.tsv.gz

Combine the files from hidden-state prediction for each functional category

Uses: runs the function combine_domain_predictions within picrust2/split_domains.py, either by running scripts/combine_domains.py or within scripts/picrust2_pipeline.py.

Aim: combine the hidden-state predictions obtained separately for functional abundances within the bacteria and archaea above.

Input:

  • arc_[EC/KO/other]_predicted.tsv.gz - output from hidden-state prediction for archaea
  • bac_[EC/KO/other]_predicted.tsv.gz - output from hidden-state prediction for bacteria

Note that it should be the same functional trait used for each of these files!

Key output:

  • combined_[EC/KO/other]_predicted.tsv.gz - tab-delimited file containing per sequence functional predictions for each sequence that could be placed well in one of the trees

The input files are really just combined to give the output, with a few checks performed for overlaps in sequence names or traits between the two, but any of these output files will be needed to construct the predicted metagenome for each sample.

Predict functional abundances within samples (metagenome prediction)

Uses: runs the function run_metagenome_pipeline within picrust2/metagenome_pipeline.py, either by running scripts/metagenome_pipeline.py or within scripts/picrust2_pipeline.py.

Aim: determine the predicted functional abundance profiles per sample after normalising the input sequence abundances by the predicted marker gene copy number.

Input:

  • feature table in either tab-delimited or BIOM format containing the abundance of each ASV/sequence within each sample
  • combined_marker_nsti_predicted.tsv.gz - output from determining the best domain for each sequence, containing the predicted marker gene copy number for each study sequence
  • combined_[EC/KO/other]_predicted.tsv.gz - output from combining the hidden-state predictions for a functional category of interest (default) or one of the hidden-state prediction outputs if only a single domain is wanted for metagenome prediction

Key output:

  • [EC/KO/other]_metagenome_out/pred_metagenome_unstrat.tsv.gz - tab-delimited file containing predicted abundances for each function (rows) within each sample (columns)
  • [EC/KO/other]_metagenome_out/pred_metagenome_contrib.tsv.gz (only output if --strat_out is run) - this file contains several different columns: sample, function, taxon, taxon_abun, taxon_rel_abun, genome_function_count, taxon_function_abun, taxon_rel_function_abun and norm_taxon_function_contrib.

Other output:

  • [EC/KO/other]_metagenome_out/seqtab_norm.tsv.gz - abundance of all ASVs/sequences within the input feature table normalised by predicted marker gene copy number
  • [EC/KO/other]_metagenome_out/weighted_nsti.tsv.gz - weighted NSTI for each sample. These give an indication of how accurate the predictions for each sample are likely to be, with lower values being better and indicating that ASVs/sequences on average match well with those in the reference database
Output obtained from each step:
  1. Drop any sequences if the NSTI is above the maximum NSTI allowed (by default this is 2)
  2. Drop sequence IDs that don't overlap between the study sequences table, marker gene copy number table and predicted functional abundance table
  3. Normalise the abundances of sequences within samples by the marker gene copy number
    Output: [EC/KO/other]_metagenome_out/seqtab_norm.tsv.gz
  4. Calculate the weighted NSTIs for all samples (divide the per-sequence NSTIs by the abundance of the sequences within the sample)
    Output: [EC/KO/other]_metagenome_out/weighted_nsti.tsv.gz
  5. Get unstratified function abundances for each sample - multiply the (normalised - if applicaple) sequence abundance within a sample by the predicted abundance of the function for the sequence
    Output: [EC/KO/other]_metagenome_out/pred_metagenome_unstrat.tsv.gz
  6. Get stratified function abundances for each sample - similar to above, but this is carried out separately for each sequence within each sample, with the intermediate predicted values retained in the output
    Output: [EC/KO/other]_metagenome_out/pred_metagenome_contrib.tsv.gz

Infer pathway abundances

Uses: runs the function pathway_pipeline within picrust2/pathway_pipeline.py, either by running scripts/pathway_pipeline.py or within scripts/picrust2_pipeline.py.

Aim: predict pathway abundance and coverage within metagenome samples. Note that by default, the only option for predicting pathway abundance/coverage is that of MetaCyc pathways using EC numbers.

If you have a KEGG database subscription, you may obtain these files for KEGG pathways.

Input:

  • EC_metagenome_out/pred_metagenome_[unstrat/contrib/strat].tsv.gz - the output of the above metagenome prediction step for EC numbers in either stratified or unstratified format. Note that if you want the pathway abundances to be stratified, you must input a stratified file - this will be automatically detected within this step

Key output:

  • pathways_out/path_abun_[unstrat/contrib/strat].tsv.gz - tab-delimited file containing the abundance of pathways in either unstratified or stratified format (depending on whether the input was unstratified or stratified)
  • pathways_out/path_cov_[unstrat/contrib/strat].tsv.gz - tab-delimited file containing the coverage of pathways in either unstratified or stratified format (depending on whether the input was unstratified or stratified)
Other intermediate output:

This can optionally be saved by giving the directory for it to be saved to. By default, it will be saved within intermediate/pathways/

  • parsed_mapfile.tsv
  • regrouped_infile.tsv (if regroup option is set)
  • minpath_running/[sample_name]_minpath_details.txt (for every sample within the feature table)
  • minpath_running/[sample_name]_minpath_in.txt (for every sample within the feature table)
Output obtained from each step:
  1. If a mapping file is provided, regroup the functions in the input table to these (by default, the EC numbers are regrouped to MetaCyc reactions using pathway_mapfiles/ec_level4_to_metacyc_rxn_new.tsv)
    Output: intermediate/pathways/regrouped_infile.tsv
  2. Read in the pathways and if Minpath is being used, get the pathways that are present in the samples
    Output: intermediate/pathways/parsed_mapfile.tsv
  3. Run Minpath for each sample to get predictions of the pathways present within each sample
    Output: minpath_running/[sample_name]_minpath_details.txt and minpath_running/[sample_name]_minpath_in.txt (for each sample)
  4. Combine the Minpath output for each sample (for coverage as well as abundance if coverage is set)
    Output: pathways_out/path_abun_[unstrat/contrib/strat].tsv.gz and pathways_out/path_cov_[unstrat/contrib/strat].tsv.gz
⚠️ **GitHub.com Fallback** ⚠️