Creating taxon specific slices of the Pfam HMM database - ababaian/serratus GitHub Wiki

Introduction

Until now the DARTH annotation pipeline for the Serratus project has relied on a coronavirus-targeted set of curated HMM models for annotating our assemblies. Given the out-groups that we need to annotate, and some of the distant clades we believe we are uncovering, this CoV-centric resource is possibly too constraining. We need a resource that can cover all of Nidovirales domains, just to be sure we are not missing domains.

Approach

UniProt queries

We can use the manually-curated proteins in SwissProt to figure out what are the Pfam domains associated with Nidovirales.

On UniProt.org, you can specify that you want to search the SwissProt (i.e., manually curated) portion of the database. On the left-hand margin, you can click on 'Taxonomy' to get a taxonomic view of the results. Browsing to Nidovirales provides a view of the distribution of sequences.

You can click on (541 results) next to Nidovirales on the tree to get the following result set. Modify the displayed columns so that there's a column of the PFam annotation.

Scraping out the unique Pfam accessions

You can then click on 'Download' and select the 'text' format, to get the attribute-value format used internally by UniProt. From there, scraping out the unique Pfam domains is easy-peasy-lemon-squeezy:

awk -F';' '$1 ~ /Pfam/ { print $2 }' ~/Downloads/uniprot-reviewed\ yes+taxonomy\ 76804.txt | tr -d ' ' | sort | uniq | wc -l

... which reports that there are 91 unique Pfam accessions used across the 541 curated Nidovirales proteins in UniProt.

Slicing the Pfam-A.hmm file with the list of Pfam accessions

Now that we have the accessions, we compare them with the existing CoV Pfam database, to make sure that we now have a super-set of that set:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/data/pfam$ comm -12 <(awk '/^ACC/ { gsub(/\..*$/,""); print $2}' ../Pfam-A.SARS-CoV-2.hmm | sort ) <(awk '/^ACC/ { gsub(/\..*$/,""); print $2 }' Pfam-A.hmm | sort )| wc -l
40

This demonstrates that we have exactly 40 entries which are present in both databases, which is the number of HMMs in Pfam-A.SARS-CoV-2.hmm.

We now slice the full Pfam-A.hmm file to get what we want:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/data/pfam$ time awk 
'NR==FNR   { ids[$1]; next} 
 /^HMMER3/ { print_p = 0; hmmer_line = $0} 
 /^NAME/   { name_line=$0} 
 /^ACC/ && acc_key=$2 && gsub(/\..*$/,"",acc_key) && acc_key in ids { print hmmer_line; print name_line; print_p = 1} 
 print_p' ../../src/nidovirales-pfam-ids.txt Pfam-A.hmm > Nido-Pfam-A.hmm

Then, we index the HMMs:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/data/pfam$ hmmpress Nido-Pfam-A.hmm 
Working...    done.
Pressed and indexed 91 HMMs (91 names and 91 accessions).
Models pressed into binary file:   Nido-Pfam-A.hmm.h3m
SSI index for binary model file:   Nido-Pfam-A.hmm.h3i
Profiles (MSV part) pressed into:  Nido-Pfam-A.hmm.h3f
Profiles (remainder) pressed into: Nido-Pfam-A.hmm.h3p

Here is the Nidovirales-specific set of HMMs that I created from the latest release of Pfam-A: https://serratus-public.s3.amazonaws.com/tmp/Nido-Pfam-A.hmm

Results

Looking at a specific example, I worked on SRR5234495, which is the most-distant assembly that we have recovered so far. Using the stock CoV Pfam DB, and stock hmmsearch parameters, we were getting just three domains (RdRP_1, CoV_NSP15_C, and Viral_helicase1).

Playing with the hmmsearch parameters, cranking up the sensitivity, now I get the following for SRR5234495:

AAA_12  PF13087 0.028
Arteri_Gl       PF00951 0.023
Arteri_Gl       PF00951 0.023
Arteri_nucleo   PF01481 0.0006
Corona_5a       PF06336 0.017
Corona_NS1      PF06145 0.047
Corona_NS1      PF06145 0.047
Corona_NS12-7   PF04753 0.0044
Corona_NS12-7   PF04753 0.071
Corona_NS12-7   PF04753 0.071
CoV_E   PF02723 0.078
CoV_Methyltr_1  PF06471 3.5e-12
CoV_Methyltr_2  PF06460 2.8e-11
CoV_NSP15_C     PF19215 3.6e-21
CoV_NSP15_C     PF19215 0.017
CoV_NSP4_N      PF19217 0.036
CoV_RPol_N      PF06478 7.8e-08
CoV_S1  PF01600 0.0099
CoV_S1_C        PF19209 0.032
CoV_S2  PF01601 1.8e-13
Peptidase_C30   PF05409 0.08
Peptidase_C31   PF05410 0.059
PRRSV_Env       PF02340 0.06
RdRP_1  PF00680 1e-25
SARS_3b PF12383 0.085
Spike_torovirin PF17072 1.8e-09
Viral_helicase1 PF01443 1.4e-21

Specifically, this is how hmmsearch was called:

time hmmsearch --max \                                                                                                         
        -A match-alignments-nido-max.sto \                                                                                                                                                     
        --domtblout hmmsearch-matches-nido-max.txt \                                                                                                                                           
        ~/repos/darth/data/pfam/nido-pfam/Nido-Pfam-A.hmm \                                                                                                                                    
        trans-6-frame.fasta  \                                                                                                                                                                 
        > hmmstdout-nido-max.txt

The use of --max is critical, as this turns off the Blast-like search accelearation heuristics that HMMER3 uses by default. For very divergent sequences, you want to run the HMM over everything, cutting no corners.

Why bother creating a slimmed down database, and not just run all of Pfam-A.hmm? Two reasons:

Much slower (2 seconds versus 7.5 minutes)
Running with the full Pfam database leads to a huge increase in false-positive alignments, especially when operating at max sensitivity.