Creating a custom trait table - picrust/picrust2 GitHub Wiki

In this page, I give directions for making a custom trait file that can be used with the existing PICRUSt2 database of genomes (PICRUSt2-SC). I am assuming some basic bioinformatics knowledge (and in fact, would not recommend this if you are a complete beginner in the field), so I'm assuming that e.g. you can install the command line programs for yourself if you come across something that isn't found.

A basic overview of the steps that we'll be carrying out:

  1. a) Download the GTDB reference genomes (r214.1) (gtdb_genomes_reps_r214.tar.gz in here)
  2. b) Filter these genomes to only those included within PICRUSt2 (i.e. the genomes in default files: archaea/archaea_metadata.csv and bacteria/bacteria_metadata.csv)
  3. Annotate the genomes with your function of interest (I give directions for doing this using a Hidden Markov Model)
  4. Count the number of copies of this function within each genome to create a table containing these counts for each of the bacteria and archaea
  5. Run PICRUSt2 with your custom trait tables

You will need

Either:

  1. A method for annotating the genomes with your function of interest
  2. A file containing nucleotide sequences for genes known to carry out the function that you are interested in

1. Get the PICRUSt2-SC genomes

1.1 Download copies of the lists of genomes included within PICRUSt2

Download the csv files and unzip them:

wget https://raw.githubusercontent.com/picrust/picrust2/master/picrust2/default_files/archaea/archaea_metadata.csv.gz
wget https://raw.githubusercontent.com/picrust/picrust2/master/picrust2/default_files/bacteria/bacteria_metadata.csv.gz
gunzip *

Create copies of these files with only the first column (this contains the genome names) and with the header removed:

awk -F"," '{print $1}' archaea_metadata.csv | sed '1d' > archaea_genomes.txt
awk -F"," '{print $1}' bacteria_metadata.csv | sed '1d' > bacteria_genomes.txt

1.2 Download the GTDB reference genomes (r214.1)

Download the genomes and unzip the tar file:

wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
tar -xvf gtdb_genomes_reps_r214.tar.gz

Now we'll filter these to only include the genomes of interest to us.

First, move all of the genomes into a single folder:

mkdir gtdb_genomes
find gtdb_genomes_reps_r214/ -type f -print0 | xargs -0 mv -t gtdb_genomes/

You should see that you have 85,205 genomes in this folder now (e.g. with ls gtdb_genomes | wc -l.

1.3 Filter to only the genomes included in PICRUSt2

Now take out only the bacterial and archaeal genomes:

mkdir archaea_genomes
mkdir bacteria_genomes
parallel -j 12 -a archaea_genomes.txt 'mv gtdb_genomes/{}_genomic.fna.gz archaea_genomes/'
parallel -j 12 -a bacteria_genomes.txt 'mv gtdb_genomes/{}_genomic.fna.gz bacteria_genomes/'

You should again have a look to see that you have the right number of genomes:

ls archaea_genomes | wc -l 
less archaea_genomes.txt | wc -l
ls bacteria_genomes | wc -l
less bacteria_genomes.txt | wc -l

You should see 1,002 archaeal genomes and 26,868 bacterial genomes.

At this point, you could delete the other files if you want, i.e. in gtdb_genomes, gtdb_genomes_reps_r214 and gtdb_genomes_reps_r214.tar.gz.

3. Annotate genomes with function of interest

I'm giving directions here for annotating the genomes using a HMM, but you could use any method you like.

3.1 Build a HMM with sequences of interest

I'm demonstrating this using alkB (alkane monooxygenase) because this is one that I've used recently. You can use any function you like, but ideally, this would be a list of sequences for which this function is experimentally verified. It should be a fasta file containing nucleotide sequences. If you want to use amino acid sequences, it wouldn't be impossible, but would be a lot more leg work. You would need to get hold of the amino acid sequences for all of the PICRUSt2 genomes (from NCBI) and then run this on them instead. You could match the genome ID's from the NCBI RefSeq/Genbank accession lists and find the URL's to download these from the NCBI FTP site. I could add details on how to do this at some point in the future if enough people requested it.

My input file of nucleotide sequences is called alkB_nt.fasta, has 194 sequences, and looks like this:

>ENA|CUH38979|CUH38979.1 Jannaschia seosinensis Alkane 1-monooxygenase 2
ATGACGATTGTCACGCTGGAGGAGGACCGGGTCCGCGGACCCGCCGCGGCGCTGCCGTTCTGGATCTCGCTGACCATGGTGCCGGTCATCTGGTTCAGCGCGGTGCAGGGCGGAT
GGTGGGTTCTGTCGGTGCCGGTTTATGCTTGGGGCCTGTACTCTGTCCTCGACCTTGTGGCCGGAACGGAGGAGCGCAATGCCGACCCCCTGACGGATCGATCGCAGCTGGGCTG
GTACCGCGCGATCACGCTGCTCTGGGGGCCGGTGCAGATTGCGACGCTTTTCGCGCTGATCGACTACGCGACCCGCGCGCCGCATCTGGGCGTGTGGGAACTCGTGGGCCTGTTC
TTCGGGATGGGTGTCGTGTCGGGCACGATCGGCATCAACTATGCCCATGAGTTGCTGCACCAGAAGCCGAAATGGGACCGGTGGACGGCGGATCTGCTGCTGGCGTCGGTGCTCT
ATTCGCATTTCCGGTCGGAGCATCTTCTGGTCCATCATCGCCATGTCGCCACCCCGCGCGACCCGGCCACGGCCCGTTACAACGAGGGATTCCACCGCTTCTATCCACGCGTCCT
GCGCCAGAGCCTCGTGTCGGCCTGGCGGGCGGAGGCGGCGATGCTGGCGCGCCGGGGGTTGCCGTTCTGGCACATTGGCAACCCGTTCTGTCGCTATGGTGCGTTGCAGGCGGCG
ATGCTGTCGATGGCGCTGCTCCTGGGGGGATGGACGGGGCTGGGGCTGTTTCTGGTGCAGGCGGGAACCGCGATCTGGCAGCTCGAGCTCGTCAATTATGTCGAGCATTATGGCC
TGACGCGGCGCCATCTGGGCGGCGGCAGATACGAACCCGTGCGCCCGCATCACTCGTGGAACTCGGCGCGGAAGGTGTCGAACTGGATGCTGATCAACCTGCAGCGTCATTCCGA
CCATCACTATAAGCCCGACCGGCCTTTTCCCTTGCTGCAGACCTACGACGAAGAGGAGGCGCCGCAACTGCCTGCGGGCTACCTGGCGATGACCGCGGCGGCGATGGTCCCGCCG
CTCTGGCGGCGCATGATGAACCCGCGGGTGCGCGCATGGCGGCGGCGGCATTACCCCGACATCGGGGATTGGTCCGCTTGCCGCGCCGCCTCGACCCCCTTGAGACGCTAG
>ENA|CUH42521|CUH42521.1 Ruegeria atlantica Alkane 1-monooxygenase 2
ATGATCGCATCAGACAGTCTGTCACGCTGGCAAGCGGTATTGCCCTTCTGGGTGTCCTTCCTGCTGATCCCCGTCATCTGGCTTACCGCCGTTGCCGGAGGCTGGTGGGTCTTGCT
TGTGCCTCTTTCGACTTGGTGGGCGTTTGCAGTTCTTGATCGGTTAACGGGGCTGAACCTTGAAAATGCCGATCCGGCCACGCCGGACGGCGGTCTCAAATGGTACATTCTGCTGA
CCAAGGCATGGGCCCCGGTTCAGTTTCTTACGCTGTTTGGGTTGCTCTGGTACGCCACTCATACCGATCACCTGTCTGTTTTTAGCAAGATTGGATTGTTCTTCGGGATGGGTGTG
ATCTCGGGCACCATCGGGATCAACTACAGCCATGAATTGATGCACCAGAATGGGCGGTCAGAACGTTGGCTGGGTGATATTCTACTGGCCATGGTGCTGTATTCTCACTTCCGGTC
AGAGCACCTGCTGGTTCACCATCGCTATGTCGGAACACCACGCGATCCGGTCACTGCACGGTACAATGAAGGGTTTCACCGCTTCTATCCGCGCGTTCTGTGGCAGTGTCTTGCAT
CCGCCTTCCACGCAGAAAGGGATAAGCTGGCGCGCAAGCAACGTCCGTGGACCGACCCGGCGAACCCGTTCTGGCGTTATTGGGCGCTTCAGGCAGGCATGCTGATTTTGGCCTTC
GTGGTCGGGGGATGGTCAGGTGTGGCCCTCTTTCTGGTGCAAGCGGGCGTCGCCATATGGCAGCTGGAACTGGTGAACTACGTCGAACATTACGGCCTGACCCGCAAACATCTTGG
CGATGGCAAGTACGAGCATGTCCAGCCACGCCATTCGTGGAACGCCTCGCAAAAAGCGTCGAACTGGCTGCTGATCAACCTACAGCGTCACTCGGACCATCACTACAAACCCAACC
GGCGCTTTCCGCTGCTGCAGCATCATTCGGAAGATGCGGCCCCGCAACTGCCCTATGGCTATCCGGTTATGACCGTCGCTGCGATGGTCCCACCGCTTTGGCGGCGGGTCATGAAC
CCCCGCGTTCGGGCCTGGCGGCGTCAATATTATCCCGAAGTCACCGATTGGAAACCGTACAACAGATCTCTCAACCCGTTGCCGCGTTCGTGA

Make a multiple sequence alignment using clustal-omega:

clustalo --in=alkB_nt.fasta --out=alkB_nt.sto --force --outfmt=st --wrap=80 --threads 12

Convert alignment to fasta format with esl-reformat:

esl-reformat -o alkB_nt.afa afa alkB_nt.sto

Build HMM with HMMer:

hmmbuild --cpu 12 alkB_nt.hmm alkB_nt.sto

3.2 Search genomes using HMM

Make the output directories:

mkdir hmmsearch_bacteria
mkdir hmmsearch_archaea

Call genes in genomes with Prodigal:

mkdir archaea_prodigal_out
mkdir bacteria_prodigal_out
parallel -j 12 --eta -a archaea_genomes.txt 'gunzip archaea_genomes/{}_genomic.fna.gz'
parallel -j 12 --eta -a bacteria_genomes.txt 'gunzip bacteria_genomes/{}_genomic.fna.gz'
parallel -j 12 --eta -a archaea_genomes.txt 'prodigal -i archaea_genomes/{}_genomic.fna -d archaea_prodigal_out/{}.fna'
parallel -j 12 --eta -a bacteria_genomes.txt 'prodigal -i bacteria_genomes/{}_genomic.fna -d bacteria_prodigal_out/{}.fna'

Search the genomes:

parallel -j 12 --eta -a archaea_genomes.txt 'hmmsearch --tblout hmmsearch_archaea/{}.out alkB_nt.hmm archaea_prodigal_out/{}.fna'
parallel -j 12 --eta -a bacteria_genomes.txt 'hmmsearch --tblout hmmsearch_bacteria/{}.out alkB_nt.hmm bacteria_prodigal_out/{}.fna'

3.3 Collate the output from the HMM searches

Get the count_hmm_hits.py script and run this (this script is here or will be pre-installed with PICRUSt2):

python count_hmm_hits.py -hf hmmsearch_archaea -g alkB -o archaea_alkB.txt
python count_hmm_hits.py -hf hmmsearch_bacteria -g alkB -o bacteria_alkB.txt

Note that this script uses an e-value of 10 to the power of -25 by default. This is what we have found to work well in our own work, but you can change the value using the -e flag within this script, e.g. -e -35.

4. Run PICRUSt2 with the new custom trait tables

For this, you can just run PICRUSt2 in the normal way but specify these tables for your traits:

picrust2_pipeline.py -s chemerin_16S/seqs.fna -i chemerin_16S/table.biom -o picrust2_out_pipeline_custom_traits -p 12 --custom_trait_tables_ref1 bacteria_alkB.txt --custom_trait_tables_ref2 archaea_alkB.txt --no_pathways

Note that you will likely need to set the --no_pathways option, unless your custom tables contain EC numbers that match the Metacyc pathways files.