Creating taxon specific slices of the Pfam HMM database - ababaian/serratus GitHub Wiki
Introduction
Until now the DARTH annotation pipeline for the Serratus project has relied on a coronavirus-targeted set of curated HMM models for annotating our assemblies. Given the out-groups that we need to annotate, and some of the distant clades we believe we are uncovering, this CoV-centric resource is possibly too constraining. We need a resource that can cover all of Nidovirales domains, just to be sure we are not missing domains.
Approach
UniProt queries
We can use the manually-curated proteins in SwissProt to figure out what are the Pfam domains associated with Nidovirales.
On UniProt.org, you can specify that you want to search the SwissProt (i.e., manually curated) portion of the database. On the left-hand margin, you can click on 'Taxonomy' to get a taxonomic view of the results. Browsing to Nidovirales provides a view of the distribution of sequences.
You can click on (541 results) next to Nidovirales on the tree to get the following result set. Modify the displayed columns so that there's a column of the PFam annotation.
Scraping out the unique Pfam accessions
You can then click on 'Download' and select the 'text' format, to get the attribute-value format used internally by UniProt. From there, scraping out the unique Pfam domains is easy-peasy-lemon-squeezy:
awk -F';' '$1 ~ /Pfam/ { print $2 }' ~/Downloads/uniprot-reviewed\ yes+taxonomy\ 76804.txt | tr -d ' ' | sort | uniq | wc -l
... which reports that there are 91 unique Pfam accessions used across the 541 curated Nidovirales proteins in UniProt.
Slicing the Pfam-A.hmm file with the list of Pfam accessions
Now that we have the accessions, we compare them with the existing CoV Pfam database, to make sure that we now have a super-set of that set:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/data/pfam$ comm -12 <(awk '/^ACC/ { gsub(/\..*$/,""); print $2}' ../Pfam-A.SARS-CoV-2.hmm | sort ) <(awk '/^ACC/ { gsub(/\..*$/,""); print $2 }' Pfam-A.hmm | sort )| wc -l
40
This demonstrates that we have exactly 40 entries which are present in both databases, which is the number of HMMs in Pfam-A.SARS-CoV-2.hmm
.
We now slice the full Pfam-A.hmm file to get what we want:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/data/pfam$ time awk
'NR==FNR { ids[$1]; next}
/^HMMER3/ { print_p = 0; hmmer_line = $0}
/^NAME/ { name_line=$0}
/^ACC/ && acc_key=$2 && gsub(/\..*$/,"",acc_key) && acc_key in ids { print hmmer_line; print name_line; print_p = 1}
print_p' ../../src/nidovirales-pfam-ids.txt Pfam-A.hmm > Nido-Pfam-A.hmm
Then, we index the HMMs:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/data/pfam$ hmmpress Nido-Pfam-A.hmm
Working... done.
Pressed and indexed 91 HMMs (91 names and 91 accessions).
Models pressed into binary file: Nido-Pfam-A.hmm.h3m
SSI index for binary model file: Nido-Pfam-A.hmm.h3i
Profiles (MSV part) pressed into: Nido-Pfam-A.hmm.h3f
Profiles (remainder) pressed into: Nido-Pfam-A.hmm.h3p
Here is the Nidovirales-specific set of HMMs that I created from the latest release of Pfam-A:
https://serratus-public.s3.amazonaws.com/tmp/Nido-Pfam-A.hmm
Results
Looking at a specific example, I worked on SRR5234495
, which is the most-distant assembly that we have recovered so far. Using the stock CoV Pfam DB, and stock hmmsearch
parameters, we were getting just three domains (RdRP_1, CoV_NSP15_C, and Viral_helicase1).
Playing with the hmmsearch parameters, cranking up the sensitivity, now I get the following for SRR5234495
:
AAA_12 PF13087 0.028
Arteri_Gl PF00951 0.023
Arteri_Gl PF00951 0.023
Arteri_nucleo PF01481 0.0006
Corona_5a PF06336 0.017
Corona_NS1 PF06145 0.047
Corona_NS1 PF06145 0.047
Corona_NS12-7 PF04753 0.0044
Corona_NS12-7 PF04753 0.071
Corona_NS12-7 PF04753 0.071
CoV_E PF02723 0.078
CoV_Methyltr_1 PF06471 3.5e-12
CoV_Methyltr_2 PF06460 2.8e-11
CoV_NSP15_C PF19215 3.6e-21
CoV_NSP15_C PF19215 0.017
CoV_NSP4_N PF19217 0.036
CoV_RPol_N PF06478 7.8e-08
CoV_S1 PF01600 0.0099
CoV_S1_C PF19209 0.032
CoV_S2 PF01601 1.8e-13
Peptidase_C30 PF05409 0.08
Peptidase_C31 PF05410 0.059
PRRSV_Env PF02340 0.06
RdRP_1 PF00680 1e-25
SARS_3b PF12383 0.085
Spike_torovirin PF17072 1.8e-09
Viral_helicase1 PF01443 1.4e-21
Specifically, this is how hmmsearch
was called:
time hmmsearch --max \
-A match-alignments-nido-max.sto \
--domtblout hmmsearch-matches-nido-max.txt \
~/repos/darth/data/pfam/nido-pfam/Nido-Pfam-A.hmm \
trans-6-frame.fasta \
> hmmstdout-nido-max.txt
The use of --max
is critical, as this turns off the Blast-like search accelearation heuristics that HMMER3 uses by default. For very divergent sequences, you want to run the HMM over everything, cutting no corners.
Why bother creating a slimmed down database, and not just run all of Pfam-A.hmm? Two reasons:
- Much slower (2 seconds versus 7.5 minutes)
- Running with the full Pfam database leads to a huge increase in false-positive alignments, especially when operating at max sensitivity.