Rfam based structural annotation of a viral genome sequence - ncbi/vadr GitHub Wiki

When building a VADR model of a virus sequence, secondary structural elements can be included in the covariance model used for annotating new sequences. Including these structural elements can make alignment of new sequences more accurate, as well as allow annotation of those structural elements in the VADR output.

The secondary structural elements must be added to a single sequence 'alignment' in Stockholm format provided as input to the v-build.pl program using the --stk option. This page will walk you through how to use the Rfam database of RNA families and Infernal to create a structure annotated Stockholm file to use with v-build.pl.

Required files/programs:

(The 'Preliminary steps' section below includes instructions for downloading and installing these required files/programs)

  • infernal: software for RNA sequence and secondary structure analysis using covariance models (CMs)

  • Rfam: database of RNA families with a library of CMs

  • jiffy-infernal-hmmer-scripts: collection of perl scripts for handling Infernal/HMMER/Easel output


Preliminary steps:

install infernal

$ wget http://eddylab.org/software/infernal/infernal.tar.gz
$ tar zxf infernal.tar.gz
$ cd infernal-1.1.5
$ ./configure --prefix /your/install/path
$  make
$ make check                 # optional: run automated tests
$ make install               # optional: install Infernal programs, man pages
$ (cd hmmer; make install)   # optional: install HMMER programs
$ (cd easel; make install)   # optional: install Easel tools

download, gunzip and index the Rfam CM library:

$ wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
$ gunzip Rfam.cm.gz
$ cmpress Rfam.cm

download the set of required perl scripts

$ git clone https://github.com/nawrockie/jiffy-infernal-hmmer-scripts.git

Step 1: Download a viral genome sequence from GenBank

For this example we'll use the Dengue serotype 2 RefSeq sequence NC_001474. You can download this file from GenBank (Follow links 'Send to' -> 'Complete Record' + 'File' + 'Format:Fasta' -> 'Create File' ).

It is important that the sequence name be specifically in the format 'Accession.version' (e.g. NC_001474.2).

Note, this procedure is not exactly how the secondary structure for the NC_001474 model included in the VADR package was determined. That structure was determined in collaboration with Dengue experts and only includes trusted structural elements that were decided should be annotated on all incoming Dengue type 2 GenBank sequences. Following this procedure will result in a different secondary structure than the VADR model.


Step 2: Scan the sequence for all Rfam hits using Infernal's `cmscan' program

The cmscan program will scan all Rfam models against the NC_001474 genome sequence and report any high-scoring hits.

$ cmscan Rfam.cm NC_001474.fa > NC_001474.cmscan

Step 3. Determine which set of non-overlapping hits you want to include in your global secondary structure prediction.

Look at the cmscan output in NC_001474.cmscan.

Query:       gi|158976983|ref|NC_001474.2|  [L=10723]
Description: Dengue virus 2, complete genome
Hit scores:
 rank     E-value  score  bias  modelname        start    end   mdl trunc   gc  description
 ----   --------- ------ -----  --------------- ------ ------   --- ----- ----  -----------
  (1) !   1.9e-22  109.9   0.0  Flavivirus-5UTR      1    148 +  cm    no 0.40  Flavivirus 5' UTR
  (2) !   2.9e-19   90.0   0.0  Flavivirus_DB    10541  10612 +  cm    no 0.60  Flavivirus DB element
  (3) !   4.4e-17   91.5   0.0  Flavi_CRE        10631  10723 +  cm    no 0.49  Flavivirus 3' UTR cis-acting replication element (CRE)
  (4) !   4.5e-15   74.4   0.0  Flavivirus_DB    10454  10525 +  cm    no 0.57  Flavivirus DB element
  (5) !   9.9e-15   80.2   0.0  DENV_SLA             1     70 +  cm    no 0.44  Dengue virus SLA
  (6) !   3.3e-11   62.6   0.0  cHP                 97    141 +  cm    no 0.38  Flavivirus capsid hairpin cHP
 ------ inclusion threshold ------
  (7) ?     0.087   30.3   0.0  Flavi_ISFV_CRE   10629  10721 +  cm    no 0.51  Insect-specific Flavivirus 3' UTR cis-acting replication eleme
  (8) ?       2.8   13.3   0.1  TTC28-AS1_2       7835   7919 + hmm     - 0.46  TTC28 antisense RNA 1 conserved region 2
  (9) ?       4.5   17.5   0.0  EAV_LTH           1041    999 -  cm    no 0.47  Equine arteritis virus leader TRS hairpin (LTH)
 (10) ?       7.6   19.9   0.0  mir-1305         10723  10645 -  cm    5' 0.48  mir-1305 microRNA precursor family

You'll want to strongly consider all the hits above the inclusion threshold, and in this case all of the E-values are less than 1e-10 so these are all compelling hits that should be included. However, the set of hits you choose to use for structure annotation must be non-overlapping.

Note that hits (1) overlaps with hits (5) and (6). You should choose to include the structure from either hit (1) or hits (5) and (6). In this case, I would lean towards using (5) and (6) for two reasons:

A. Both hit (1) and the pair of hits (5) and (6) span about the first 140 or 150 nts, but the summed score of hits (5) and (6) is higher than the score of hit (1) by about 30 bits.

B. The model for hit (5) is specific to Dengue virus (description field), while the model for hit (1) seems to be more general to Flavivirus.

You can also look at the per-hit alignments in the NC_001474.cmscan output file to see which hit is better. For help on interpreting these alignments, see the Infernal user's guide.

You may also want to consider some hits below the inclusion threshold especially those with E-values below 1 and/or with model names and descriptions that suggest they are expected in viruses like the one you've just scanned.

In this case hit (7) is a Flavivirus model, but it overlaps nearly completely with hit (3) which has a much more significant E-value, so we should keep hit (3) instead.

So the set of hits we'll use to structurally annotate our viral genome is:

Query:       NC_001474.2  [L=10723]
Description: Dengue virus 2, complete genome
Hit scores:
 rank     E-value  score  bias  modelname        start    end   mdl trunc   gc  description
 ----   --------- ------ -----  --------------- ------ ------   --- ----- ----  -----------
  (2) !   2.9e-19   90.0   0.0  Flavivirus_DB    10541  10612 +  cm    no 0.60  Flavivirus DB element
  (3) !   4.4e-17   91.5   0.0  Flavi_CRE        10631  10723 +  cm    no 0.49  Flavivirus 3' UTR cis-acting replication element (CRE)
  (4) !   4.5e-15   74.4   0.0  Flavivirus_DB    10454  10525 +  cm    no 0.57  Flavivirus DB element
  (5) !   9.9e-15   80.2   0.0  DENV_SLA             1     70 +  cm    no 0.44  Dengue virus SLA
  (6) !   3.3e-11   62.6   0.0  cHP                 97    141 +  cm    no 0.38  Flavivirus capsid hairpin cHP

Step 4: Run cmsearch to create alignments of each structural element

For each of the hits above we'll use Infernal's cmsearch program to create a structure annotated 'alignment' of the subsequence of our viral genome that matches the Rfam model. In this case we have 4 models so we do 4 cmsearch calls, saving the output alignments in Stockholm format to a file using the -A option:

$ cmfetch Rfam.cm Flavivirus_DB  | cmsearch -A db.stk  - NC_001474.fa
$ cmfetch Rfam.cm Flavi_CRE      | cmsearch -A cre.stk - NC_001474.fa
$ cmfetch Rfam.cm DENV_SLA       | cmsearch -A sla.stk - NC_001474.fa
$ cmfetch Rfam.cm cHP            | cmsearch -A chp.stk - NC_001474.fa

After these 4 commands, we will have 4 stockholm formatted alignments. Each of these stockholm alignments includes lines that begin with #=GC SS_cons which indicate the predicted secondary structure of the viral genome subsequence that matched well to the corresponding Rfam model. For example take a look at the chp.stk file:

# STOCKHOLM 1.0
#=GF AU Infernal 1.1.4

#=GS NC_001474.2/97-141 DE Dengue virus 2, complete genome

NC_001474.2/97-141         AUGAAUAACCAACGGAAAAAGGCGAAAAACACGCCUUUCAAUAUG
#=GR NC_001474.2/97-141 PP *********************************************
#=GC SS_cons               ::::::::::::::::<<-<<<<<<_____>>>>>>->>::::::
#=GC RF                    AUGAaUAAcCAACGAAaAAaGgCgGGAAAACcGcCuuUcAAUAUG
//

Step 5: Split any multiple sequence alignments into single sequence 'alignments'

For the subsequent steps, we want each of our alignment files to only have a single sequence in it. Three of these alignments will have a single sequence in them, but one of them has 2 sequences in it: db.stk because two of the five hits that we saved (2) and (4) were to this model, while the other 3 models only had 1 hit.

You may not have any models with more than 1 sequence in the output cmsearch alignment, but if you do, use the ali-multiseq-to-singleseq.pl perl script to split those alignments into single sequence 'alignment' files, like this:

$ perl jiffy-infernal-hmmer-scripts/ali-multiseq-to-singleseq.pl db.stk
Saved single-sequence alignment with sequence NC_001474.2/10541-10612 to db.stk.1
Saved single-sequence alignment with sequence NC_001474.2/10454-10525 to db.stk.2

Step 6: Convert the stockholm alignments to one-line-per-sequence 'Pfam' stockholm format and remove any gap columns

The esl-alimask program is installed with infernal.

$ esl-alimask --outformat pfam --keepins -g db.stk.1 > db1.pfam
$ esl-alimask --outformat pfam --keepins -g db.stk.2 > db2.pfam
$ esl-alimask --outformat pfam --keepins -g cre.stk > cre.pfam
$ esl-alimask --outformat pfam --keepins -g sla.stk > sla.pfam
$ esl-alimask --outformat pfam --keepins -g chp.stk > chp.pfam

Step 7: Merge the secondary structure annotations into a single structure annotation string the length of the full genome

The next step is to merge the secondary structure annotation from each of the alignment files into a single secondary structure annotation string. It is crucial that none of the Rfam hits overlap in sequence coordinates in the viral genome, or else this step will not work.

You'll use the ali-sscons-merge.pl script for this and you'll have to figure out how many positions are missing from the 5' and 3' ends of each of the hits. For example, The first Flavivirus_DB hit in db1.pfam spans positions 10541 to 10612 (see cmscan output above). The NC_001474.2 sequence is 10723 nucleotides long (you can determine this with the command esl-seqstat NC_001474.fa), so it is missing 10540 positions on the 5' end and 10723-10612=111 positions on the 3' end. The second Flavivirus_DB hit in db2.pfam spans positions 10631 to 10723. You then use the --one_m5, --one_m3, --two_m5, and --two_m3 options to ali-sscons-merge.pl to merge them:

perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --one_m5 10540 --one_m3 111 --two_m5 10453 --two_m3 198 db1.pfam db2.pfam > merge1.pfam

After this the merge1.pfam 'alignment' will include the sequence from the first alignment file used in the previous ali-sscons-merge.pl call but will have gaps added to make it the full length (10723) of the viral genome, and it will also include a SS_cons annotation line that includes the predicted secondary structure from both db1.pfam and db2.pfam. You then use this as the first alignment in subsequent ali-sscons-merge.pl calls to add in the additional secondary structure from the additional hits with the below three commands. Because the first alignment in these cases will be the full length, the --one_m5 and --one_m3 options should not be used.

$ perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --two_m5 10630 --two_m3 0 merge1.pfam cre.pfam > merge2.pfam
$ perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --two_m5 0     --two_m3 10653 merge2.pfam sla.pfam > merge3.pfam
$ perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --two_m5 96    --two_m3 10582 merge3.pfam chp.pfam > merge4.pfam

Step 8: Combine the merged secondary structure with the full length genome sequence.

There are several commands required for this step.

First, reformat the fasta file to a single sequence pfam format alignment:

$ esl-reformat pfam NC_001474.fa > NC_001474.pfam

And add RF annotation to it:

$ perl jiffy-infernal-hmmer-scripts/ali-pfam-add-x-rf.pl NC_001474.pfam > NC_001474.rf.pfam

Next, extract the full length SS_cons annotation as a string and output it to a new text file:

$ cat merge4.rf.pfam | grep SS_cons | awk '{ print $3 }' > merge.sscons

And finally, add that SS_cons annotation to the NC_001474 alignment and convert it to interleaved Stockholm format so it is easier to look at:

$ perl jiffy-infernal-hmmer-scripts/ali-pfam-add-rflen-sscons.pl NC_001474.rf.pfam merge.sscons > NC_001474.sscons.pfam
$ esl-reformat stockholm NC_001474.sscons.pfam > NC_001474.sscons.stk

Step 9: Sanity check that basepairs are mostly Watson-Crick

There's many ways we could have screwed something up. One of the commands may not have not executed properly, or there could be a bug in one or more of the scripts, or alternatively these steps may just not work for every virus and set of Rfam hits, so we want to check that the structure is valid before continuing. One way to do this is to count how many of the basepairs are Watson-Crick or GU/UG versus non-canonical. If most of the basepairs are non-canonical then we know that something went wrong.

To check this:

$ esl-alistat --bpinfo NC_001474.bpinfo NC_001474.sscons.stk
$ perl jiffy-infernal-hmmer-scripts/esl-alistat-bpinfo-count-compensatory-and-consistent.pl NC_001474.bpinfo > NC_001474.bpinfo.ct

Take a look at the NC_001474.bpinfo.ct file, extra column #8 (mcwcgu) will be 'yes' if most common basepair is WC or GU/UG, else 'no'.

$ head -n 24 NC_001474.bpinfo.ct 
# Per-column basepair counts:
# Alignment file: NC_001474.sscons.stk
# Alignment idx:  1
# Number of sequences: 1
# Only basepairs involving two canonical (non-degenerate) residues were counted.
# Sequence weights from alignment were ignored (if they existed).
#
# extra column  1: '#wc|gu': number of Watson-Crick+GU+UG basepairs
# extra column  2: '#wc':    number of Watson-Crick basepairs
# extra column  3: '#gu':    number of GU+UG basepairs
# extra column  4: 'mc':     identity of most common basepair
# extra column  5: '#mc':    number of most common basepair
# extra column  6: '#bp!mc': number of basepairs != most common basepair
# extra column  6: '#nt!mc': number of nucleotide changes in basepairs != most common basepair
# extra column  8: 'mcwcgu': 'yes' if most common basepair is WC or GU/UG, else 'no'
# extra column  9: '#comp':  if 'mcwcgu' is 'yes', number of compensatory nucleotide changes to WC or GU/UG (2 per each mutated basepair), else '-'
# extra column 10: '#cons':  if 'mcwcgu' is 'yes', number of consistent nucleotide changes to WC or GU/UG (1 per each mutated basepair), else '-'
# extra column 11: '#incon': if 'mcwcgu' is 'yes', number of inconsistent changes to non-WC/GU/UG (1 or 2 per each mutated basepair), else '-'
#
#    lpos     rpos    AA      AC      AG      AT      CA      CC      CG      CT      GA      GC      GG      GT      TA      TC      TG      TT    #wc|gu     #wc     #gu      mc     #mc  #bp!mc  #nt!mc  mcwcgu   #comp   #cons  #incon
# -------  -------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------  ------
        4       69       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       1       1       0      UA       1       0       0     yes       0       0       0
        5       68       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       1       1       0      GC       1       0       0     yes       0       0       0
        6       67       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       1       1       0      UA       1       0       0     yes       0       0       0

The mcwcgu field will be the 26th space-delimited field for every non-blank line that does not begin with #, so to count them:

$ grep -v ^\# NC_001474.bpinfo.ct | awk '{ print $26 }' | grep . | sort | uniq -c
      5 no
     84 yes

If there were more no values than yes values, it would be an indication that our secondary structure annotation was probably messed up somehow.

Another way we can catch problems isf if any of the steps above fail, hopefully with an informative error message.


Step 10: Use the structurally annotated stockholm alignment in VADR

Option 1: Use NC_001474.sscons.stk as input to v-build.pl script using the --stk option:

If you haven't yet built your VADR model, you can use v-build.pl to build it and specify that the cmbuild command used to build the CM use your new stockholm alignment NC_001474.sscons.stk with the command:

$ v-build.pl -f --stk NC_001474.sscons.stk NC_001474 test

Option 2: Rebuild the CM using cmbuild

If you've already put some work into manually refining your model and do not want to start from scratch again using v-build.pl you can rebuild the model using cmbuild. You'll want to use the same command that v-build.pl uses. You can find this in the vadr.cmd output file in the directory that was created when you originally built the model with v-build.pl. That command should look something like:

$ cmbuild -n NC_001474 --verbose --occfile NC_001474/NC_001474.vadr.cmbuild.occ --cp9occfile NC_001474/NC_001474.vadr.cmbuild.cp9occ --fp7occfile NC_001474/NC_001474.vadr.cmbuild.fp7occ NC_001474/NC_001474.vadr.cm NC_001474/NC_001474.vadr.stk > NC_001474/NC_001474.vadr.cmbuild

The *--occfile options are not essential, so to rerun this, you'd do:

$ cmbuild -n NC_001474 --verbose NC_001474.vadr.cm NC_001474.sscons.stk

And you'll want to copy that CM over the original one, probably after saving a copy of the original, if it is in the directory NC_001474 those commands would be:

$ cp NC_001474/NC_001474.vadr.cm NC_001474/orig.NC_001474.vadr.cm 
$ cp NC_001474.vadr.cm NC_001474/NC_001474.vadr.cm 

Finally, you need to run cmpress to index the file, but first delete the old index files for the original model:

$ rm NC_001474/NC_001474.vadr.cm.*
$ cmpress NC_001474/NC_001474.vadr.cm

Step 11 (optional): Add the Rfam hits as features in your VADR model

If you'd like to include the Rfam hits as features in your annotation output, you can manually add them to the .minfo file that v-build.pl created. Take a look at that file:

MODEL NC_001474 blastdb:"test.vadr.protein.fa" length:"10723"
FEATURE NC_001474 type:"gene" coords:"97..10272:+" parent_idx_str:"GBNULL" gene:"POLY"
FEATURE NC_001474 type:"CDS" coords:"97..10272:+" parent_idx_str:"GBNULL" gene:"POLY" product:"polyprotein"
FEATURE NC_001474 type:"mat_peptide" coords:"97..438:+" parent_idx_str:"1" product:"anchored capsid protein ancC"
FEATURE NC_001474 type:"mat_peptide" coords:"97..396:+" parent_idx_str:"1" product:"capsid protein C"
FEATURE NC_001474 type:"mat_peptide" coords:"439..936:+" parent_idx_str:"1" product:"membrane glycoprotein precursor prM"
FEATURE NC_001474 type:"mat_peptide" coords:"439..711:+" parent_idx_str:"1" product:"protein pr"
FEATURE NC_001474 type:"mat_peptide" coords:"712..936:+" parent_idx_str:"1" product:"membrane glycoprotein M"
FEATURE NC_001474 type:"mat_peptide" coords:"937..2421:+" parent_idx_str:"1" product:"envelope protein E"
FEATURE NC_001474 type:"mat_peptide" coords:"2422..3477:+" parent_idx_str:"1" product:"nonstructural protein NS1"
FEATURE NC_001474 type:"mat_peptide" coords:"3478..4131:+" parent_idx_str:"1" product:"nonstructural protein NS2A"
FEATURE NC_001474 type:"mat_peptide" coords:"4132..4521:+" parent_idx_str:"1" product:"nonstructural protein NS2B"
FEATURE NC_001474 type:"mat_peptide" coords:"4522..6375:+" parent_idx_str:"1" product:"nonstructural protein NS3"
FEATURE NC_001474 type:"mat_peptide" coords:"6376..6756:+" parent_idx_str:"1" product:"nonstructural protein NS4A"
FEATURE NC_001474 type:"mat_peptide" coords:"6757..6825:+" parent_idx_str:"1" product:"protein 2K"
FEATURE NC_001474 type:"mat_peptide" coords:"6826..7569:+" parent_idx_str:"1" product:"nonstructural protein NS4B"
FEATURE NC_001474 type:"mat_peptide" coords:"7570..10269:+" parent_idx_str:"1" product:"RNA-dependent RNA polymerase NS5"

You'll want to add new FEATURE lines for each of the 5 Rfam hits that look something like:

FEATURE NC_001474 type:"ncRNA" coords:"1..70:+" parent_idx_str:"GBNULL" product:"Dengue virus SLA (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"97..141:+" parent_idx_str:"GBNULL" product:"Flavivirus capsid hairpin cHP (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"10454..10525:+" parent_idx_str:"GBNULL" product:"Flavivirus DB element (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"10541..10612:+" parent_idx_str:"GBNULL" product:"Flavivirus DB element 1 (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"10631..10723:+" parent_idx_str:"GBNULL" product:"Flavivirus DB element 2 (Rfam 14.9)"

After doing this, when you run v-annotate.pl the ncRNA features will be annotated in the .tbl and .ftr output files.


Questions, comments, feature requests or problems? Open a github issue or send a mail to [email protected].

⚠️ **GitHub.com Fallback** ⚠️