RSV annotation - ncbi/vadr GitHub Wiki


How to annotate RSV sequences with VADR:

The steps below explain how to use VADR for RSV sequence validation and annotation. Testing of VADR for RSV is still ongoing and these recommendations are subject to change.

GenBank is not currently using VADR to automatically process RSV sequence submissions like it does for SARS-CoV-2. GenBank RSV sequence submissions are still all manually reviewed, but indexers typically use VADR as part of their analysis, using the command-line options listed in step 4 below.

Steps for using VADR for RSV annotation:

  1. Download and install the latest version of VADR (v1.5.1 or later), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.5.1 (or later) docker image created by Curtis Kapsak (docker image names: staphb/vadr:1.5.1 and staphb/vadr:latest), available on dockerhub and quay. A brief README for the docker image is here.

  2. Download the latest RSV VADR models (version 1.5-2, gzipped tarball) from here, unpack them (e.g. tar xfz <tarball.gz>). Note the path to the directory name created (<rsv-models-dir-path>) for step 3. (If you are using the docker image you can skip this step as the RSV VADR models are included.)

  3. Remove terminal ambiguous nucleotides from your input fasta sequence file using the fasta-trim-terminal-ambigs.pl script in $VADRSCRIPTSDIR/miniscripts/. GenBank processing of viral sequences typically includes removing ambiguous nucleotides from the beginning and ends of sequences, and also removing sequences that are less than 150nt or more than 15,500nt (after trimming). RSV sequences that are more than 15,500nt are longer than expected.

    WARNING: the fasta-trim-terminal-ambigs.pl script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.

    To remove terminal ambiguous nucleotides from your sequence file <input-fasta-file> and to remove short and long sequences to create a new trimmed file <trimmed-fasta-file>, execute:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 150 --maxlen 15500 <input-fasta-file> > <trimmed-fasta-file>
  1. Run the v-annotate.pl program on an input trimmed fasta file with RSV sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).

    *NOTE: The following command runs multithreaded on up to 4 CPUs, and so is only recommended if you have at least 4 CPUs and 128Gb RAM available. To run on <n> CPUs instead, replace --cpu 4 with --cpu <n> (which will require <n>32Gb of RAM). To run single threaded on a single CPU remove the --cpu 4 option. The --split and --cpu options are incompatible with -p.

v-annotate.pl --split --cpu 4 -r --mkey rsv --xnocomp --mdir <rsv-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>

This section shows output from an example v-annotate.pl annotation of three RSV sequences from GenBank using the above command and options.

The fasta file of those three sequences can be downloaded here.

(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)

For this example, the RSV model directory is in /usr/local/vadr-models-rsv-1.5-2 and the pretrim.rsv.3.fa sequence file is in the current directory. We will create a new file rsv.3.fa with trimmed sequences in the next step.

As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 150nt or longer than 15,500nt with the command:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 150 --maxlen 15500 pretrim.rsv.3.fa > rsv.3.fa

Next, to annotate the trimmed sequences using the above v-annotate.pl options for RSV, run the following command (scroll to the right to see full command), which will create a new directory called my3 into which VADR output files will be written.

v-annotate.pl --split --cpu 4 -r --mkey rsv --xnocomp --mdir /usr/local/vadr-models-rsv-1.5-2 rsv.3.fa my3

The options used are explained further below.

When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.5.1 (Feb 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Fri Feb 10 12:25:31 2023
# $VADRBIOEASELDIR:  /home/nawrocki/vadr-install-dir/Bio-Easel
# $VADRBLASTDIR:     /home/nawrocki/vadr-install-dir/ncbi-blast/bin
# $VADREASELDIR:     /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR:  /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRMODELDIR:     /home/nawrocki/vadr-install-dir/vadr-models-calici
# $VADRSCRIPTSDIR:   /home/nawrocki/vadr-install-dir/vadr
#
# sequence file:                                                                  rsv.3.fa
# output directory:                                                               my3
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  rsv [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /usr/local/vadr-models-rsv-1.5-2 [--mdir]
# turn off composition-based for blastx statistics with -comp_based_stats 0:      yes [--xnocomp]
# replace stretches of Ns with expected nts, where possible:                      yes [-r]
# split input file into chunks, run each chunk separately:                        yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch):            4 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:

# Validating input                                                                                          ... done. [    0.1 seconds]
# Splitting sequence file into chunks to run independently in parallel on 4 processors                      ... done. [    0.4 seconds]
# Executing 3 scripts in parallel on 4 processors to process 3 partition(s) of all 3 sequence(s)            ... 
#	   3 of    3 jobs finished (0.1 minutes spent waiting)
# done. [   64.9 seconds]
# Merging and finalizing output                                                                             ... done. [    0.8 seconds]

With the --split --cpu 4 options, the input fasta script is split up into chunks and runs v-annotate.pl separately on those chunks on 4 different CPUs in parallel. When all sequences are finished processing the main script merges the output together.

The v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:

# Summary of classified sequences:
#
#                                 num   num   num
#idx  model     group  subgroup  seqs  pass  fail
#---  --------  -----  --------  ----  ----  ----
1     KY654518  RSV    A            2     1     1
2     MZ516105  RSV    B            1     1     0
#---  --------  -----  --------  ----  ----  ----
-     *all*     -      -            3     2     1
-     *none*    -      -            0     0     0
#---  --------  -----  --------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                              per    num   num  long
#idx  code      failure  description                       type  cases  seqs  description
#---  --------  -------  -----------------------------  -------  -----  ----  -----------
1     ambgnt3f  no       AMBIGUITY_AT_FEATURE_END       feature      1     1  final nucleotide of non-CDS feature is an ambiguous nucleotide
2     ambgnt3c  no       AMBIGUITY_AT_CDS_END           feature      1     1  final nucleotide of CDS is an ambiguous nucleotide
#---  --------  -------  -----------------------------  -------  -----  ----  -----------
3     unexleng  yes      UNEXPECTED_LENGTH              feature      1     1  length of complete coding (CDS or mat_peptide) feature is not a multiple of 3
4     cdsstopn  yes      CDS_HAS_STOP_CODON             feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
5     fsthicft  yes      POSSIBLE_FRAMESHIFT_HIGH_CONF  feature      1     1  high confidence possible frameshift in CDS (frame not restored before end)
6     indf5pst  yes      INDEFINITE_ANNOTATION_START    feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
#---  --------  -------  -----------------------------  -------  -----  ----  -----------

And finally a list of the output files created:

# Output printed to screen saved in:                              my3.vadr.log
# List of executed commands saved in:                             my3.vadr.cmd
# List and description of all output files saved in:              my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   my3.vadr.fail.tbl
# list of passing sequences saved in:                             my3.vadr.pass.list
# list of failing sequences saved in:                             my3.vadr.fail.list
# list of alerts in the feature tables saved in:                  my3.vadr.alt.list
# fasta file with passing sequences saved in:                     my3.vadr.pass.fa
# fasta file with failing sequences saved in:                     my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          my3.vadr.sqa
# per-sequence tabular classification summary file saved in:      my3.vadr.sqc
# per-feature tabular summary file saved in:                      my3.vadr.ftr
# per-model-segment tabular summary file saved in:                my3.vadr.sgm
# per-alert tabular summary file saved in:                        my3.vadr.alt
# alert count tabular summary file saved in:                      my3.vadr.alc
# per-model tabular summary file saved in:                        my3.vadr.mdl
# alignment doctoring tabular summary file saved in:              my3.vadr.dcr
# replaced stretches of Ns summary file (-r) saved in:            my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time:  00:01:06.44
#                hh:mm:ss
# 
[ok]

Note that all the output files will be in the newly created my3 directory. The Summary of classified sequences listed that two sequences passed and one failed. The file my3.vadr.pass.list, lists the two sequences that passed:

KU839633.1
OM062710.1

and my3.vadr.fail.list lists the sequence that failed:

MZ516127.1

Also, FASTA-formatted sequence files for each the passing and failing sequences are my3.vadr.pass.fa and my3.vadr.fail.fa.

For the two sequences that passed, the annotation is available in the output my3.vadr.pass.tbl file and for the two sequences that failed the annotation is in the my3.vadr.fail.tbl file.

Here is the output for the first sequence in the my3.vadr.pass.tbl file: (with the middle of the table for each sequence removed for brevity)

>Feature KU839633.1
29	448	gene
			gene	NS1
29	448	CDS
			product	nonstructural protein 1
			protein_id	KU839633.1_1
556	930	gene
			gene	NS2
556	930	CDS
			product	nonstructural protein 2
			protein_id	KU839633.1_2
1069	2244	gene
			gene	N
1069	2244	CDS
			product	nucleoprotein
			protein_id	KU839633.1_3

...snip...

7600	8187	gene
			gene	M2
7600	8187	CDS
			product	M2-1 protein
			protein_id	KU839633.1_9
8153	8425	gene
			gene	M2
8153	8425	CDS
			product	M2-2 protein
			protein_id	KU839633.1_10
8491	14991	gene
			gene	L
8491	14991	CDS
			product	RNA-dependent RNA polymerase
			protein_id	KU839633.1_11

And the sequence in the my3.vadr.fail.tbl (with the middle removed for brevity):

>Feature MZ516127.1
96	515	gene
			gene	NS1
96	515	CDS
			product	nonstructural protein 1
			protein_id	MZ516127.1_1
625	999	gene
			gene	NS2
625	999	CDS
			product	nonstructural protein 2
			protein_id	MZ516127.1_2
1137	2312	gene
			gene	N
1137	2312	CDS
			product	nucleoprotein
			protein_id	MZ516127.1_3
...snip...
7666	8250	gene
			gene	M2-1
7666	8250	CDS
			product	M2-1 protein
			protein_id	MZ516127.1_9
8225	8491	gene
			gene	M2-2
8225	8491	CDS
			product	M2-2 protein
			protein_id	MZ516127.1_10
8558	>15029	gene
			gene	L
8558	>15029	misc_feature
			note	similar to RNA-dependent RNA polymerase

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:RNA-dependent RNA polymerase) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:4649,M:4649]; seq-coords:10403..10405:+; mdl-coords:10407..10409:+; mdl:KY654518;
ERROR: POSSIBLE_FRAMESHIFT_HIGH_CONF: (CDS:RNA-dependent RNA polymerase) high confidence possible frameshift in CDS (frame not restored before end) [cause:delete,S:10384,M:10388(1); frame:1(3); length:1827:(4670); shifted_avgpp:0.974; exp_avgpp:0.968;]; seq-coords:10385..15054:+; mdl-coords:10388..15058:+; mdl:KY654518;
ERROR: INDEFINITE_ANNOTATION_START: (CDS:RNA-dependent RNA polymerase) protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [1826>5]; seq-coords:8558..10383:+; mdl-coords:8561..8561:+; mdl:KY654518;
ERROR: UNEXPECTED_LENGTH: (CDS:RNA-dependent RNA polymerase) length of complete coding (CDS or mat_peptide) feature is not a multiple of 3 [6497]; seq-coords:8558..15054:+; mdl-coords:8561..15058:+; mdl:KY654518;

Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.

Note that the end of the fail.tbl file lists ERRORs for MZ516127.1. In this case, the RNA-dependent RNA polymerase coding region includes many ambiguous nucleotides (Ns) and one of those N-regions seems to be of unexpected length, causing a frameshift. To investigate issues like these further, it can be helpful to add the --keep option to v-annotate.pl which results in additional output files being created, including alignment files.

ERROR lines such as this are meant to highlight potential problems for manual review or regions of interest to the user, but they do not necessarily mean that there is a problem with the sequence.

The annotation information is also available in other files with different formats, such as the my3/my3.vadr.ftr file, which may be easier to parse for some applications.

How to interpret VADR alert/error messages and other output

For examples of alerts/errors and more information on how to interpret the VADR output related to those alerts in the output feature tables, .alt.list file and the GenBank submission portal detailed error report .tsv files, see this vadr documentation page.

See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.

You can find information on two papers on VADR (below).

Explanation of options used in example command

The options used in the above command are the recommended set of options for RSV annotation. These options are currently used by GenBank indexers when they run VADR for validating and annotating incoming RSV sequence submissions. (All submissions are currently manually reviewed, no automated processing/approval using VADR is in place for RSV sequences at this time.) The options are each briefly explained in the table below. More information can be found here,

option explanation
--split split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
--cpu 4 for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 4 CPU workers (4 can be changed to <n1> with --cpu <n1>), requires --split
-r turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
--mkey rsv use the model files with prefix rsv in the directory from --mdir
--xnocomp turn off composition-based stats for blastx, which seems to increase the length of RSV blastx alignments in some cases
--mdir /usr/local/vadr-models-rsv-1.5-2 specify that the models to use are in directory /usr/local/vadr-models-rsv-1.5-2

The -r option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see this page


RSV VADR model library

As of January 31, 2023, the VADR model library for RSV annotation (vadr-models-rsv-1.5-2 model file) includes two RSV models. One model is based on KY654518.1 a human RSV A GenBank sequence of length 15,277 nt. The other model is based on MZ516105.1 a human RSV B GenBank sequence of length 15,276 nt.

The model files were developed during testing on existing INSDC RSV A and B sequences. The initial model files were created with the following command with VADR 1.5 in December 2022, and then later modified during testing as explained more below.

v-build.pl KY654518 KY654518
v-build.pl MZ516105 MZ516105

The files in the two output directories KY654518 and MZ516105 were then combined to make a model 'library' for RSV (consisting of two models) as explained here with the exception that rsv was used as the root for naming the library files instead of my.vadr.

After that, the changes discussed below were made to the model library files:

Changes made to rsv.minfo file

In the rsv.minfo file the following changes were made manuually: The two lines that begin with MODEL were changed to:

MODEL KY654518 blastdb:"KY654518.vadr.protein.fa" cmfile:"KY654518.vadr.cm" length:"15277" group:"RSV" subgroup:"A"
MODEL MZ516105 blastdb:"MZ516105.vadr.protein.fa" cmfile:"MZ516105.vadr.cm" length:"15276" group:"RSV" subgroup:"B"

Additionally in rsv.minfo, the lines that begin with FEATURE KY654518 and include gene:"G" were removed and replaced with:

FEATURE KY654518 type:"gene" coords:"4681..5643:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"15"
FEATURE KY654518 type:"gene" coords:"4681..5646:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"16"
FEATURE KY654518 type:"gene" coords:"4681..5649:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"17"
FEATURE KY654518 type:"CDS" coords:"4681..5643:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5493:72;5494:72;5495:72;5496:72;5497:72;5498:72;5499:72;5500:72;5531:72;" xmaxdel_exc:"259:72;260:72;261:72;262:72;263:72;264:72;265:72;266:72;267:72;268:72;269:72;270:72;271:72;272:72;273:72;274:72;275:72;276:72;" alternative_ftr_set:"attachment(cds)"
FEATURE KY654518 type:"CDS" coords:"4681..5646:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5493:72;5494:72;5495:72;5496:72;5497:72;5498:72;5499:72;5500:72;5531:72;" xmaxdel_exc:"259:72;260:72;261:72;262:72;263:72;264:72;265:72;266:72;267:72;268:72;269:72;270:72;271:72;272:72;273:72;274:72;275:72;276:72;" alternative_ftr_set:"attachment(cds)"
FEATURE KY654518 type:"CDS" coords:"4681..5649:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5493:72;5494:72;5495:72;5496:72;5497:72;5498:72;5499:72;5500:72;5531:72;" xmaxdel_exc:"259:72;260:72;261:72;262:72;263:72;264:72;265:72;266:72;267:72;268:72;269:72;270:72;271:72;272:72;273:72;274:72;275:72;276:72;" alternative_ftr_set:"attachment(cds)"

And the lines that begin with FEATURE MZ516105 and include gene:"G" were removed and replaced with:

FEATURE MZ516105 type:"gene" coords:"4688..5620:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"16"
FEATURE MZ516105 type:"gene" coords:"4688..5629:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"17"
FEATURE MZ516105 type:"gene" coords:"4688..5635:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"18"
FEATURE MZ516105 type:"gene" coords:"4688..5641:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"19"
FEATURE MZ516105 type:"CDS" coords:"4688..5620:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" xmaxins_exc:"250:60;251:60;252:60;253:60;254:60;255:60;256:60;257:60;258:60;259:60;260:60" alternative_ftr_set:"attachment(cds)"
FEATURE MZ516105 type:"CDS" coords:"4688..5629:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" xmaxins_exc:"250:60;251:60;252:60;253:60;254:60;255:60;256:60;257:60;258:60;259:60;260:60" alternative_ftr_set:"attachment(cds)"
FEATURE MZ516105 type:"CDS" coords:"4688..5635:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" alternative_ftr_set:"attachment(cds)"
FEATURE MZ516105 type:"CDS" coords:"4688..5641:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" xmaxins_exc:"235:60;236:60;237:60;238:60;239:60;240:60;241:60;242:60;243:60;244:60;245:60;246:60;247:60;248:60;249:60;250:60;251:60;252:60;253:60;254:60;255:60;256:60;257:60;258:60;259:60;260:60;" alternative_ftr_set:"attachment(cds)"

These sets of lines allow different RSV sequences with alternative stop codons in the attachment glycoprotein coding sequence to be appropriately handled by v-annotate.pl through the alternative_ftr_set and alternative_ftr_set_subn attributes. Additionally, the nmaxdel_exc and xmaxins_exc attributes enable exceptions for the maximum allowed deletion and insertions, respectively, at given reference positions. These allow additional known diversity in the attachment glycoprotein to be appropriately handled (and not cause valid RSV sequences to fail due to these indels).

Changes made to covariance models (rsv.cm)

In many RSV A genomes, there is a common deletion of length 72 positions relative to the KY654518.1 sequence starting at KY654518 position 5496, and there is a similar deletion in many RSV B genomes relative to MZ516105 of length 60 starting at position 5441. To allow VADR to better handle these deletions, especially at correctly aligning sequences to the model reference sequences, the CMs in the RSV 1.5-2 model file rsv.cm were built from alignments with two sequences each. The KY654518 CM was built from the KY654518.2.stk alignment (included in the model library) which includes 2 sequences: KY654518.1 and a modified version of KY654518.1 with a deletion of length 72 starting at position 5496, which is a commonly deleted region in RSV A. Similarly, the MZ516105 CM was built from the MZ516105.2.stk alignment (included in the model library) which includes 2 sequences: MZ516105.1 and a modified version of MZ516105.1 with a deletion of length 60 starting at position 5441.

Those models were built using infernal v1.1.4 with the commands:

cmbuild -F --noss --hand -n KY654518 --eset 1.0 KY654518.2.cm KY654518.2.stk
cmbuild -F --noss --hand -n MZ516105 --eset 1.0 MZ516105.2.cm MZ516105.2.stk
cat MZ516105.2.cm KY654518.2.cm > rsv.cm

Changes made to blastx protein library (.vadr.protein.fa files)

The v-annotate.pl script validates protein coding sequences using blastx by default. To accomodate natural variation in protein coding sequences in RSV, mainly the attachment glycoprotein coding sequence, additional proteins besides the model proteins from KY654518 and MZ516105 were added to the blastx protein libraries. These were chosen based on empirical performance during testing of the RSV models with v-annotate.pl on all existing INSDC RSV sequences, by identifying sequences that failed for similar reasons, typically indf5pst and indf3pst alerts (with descriptions INDEFINITE_ANNOTATION_START and INDEFINITE_ANNOTATION_END) due to stop codons that occur upstream of the stop codon position in the KY654518 or MZ516105 sequences. Specifically, the .vadr.protein.fa files contain the following sequences: (the sequences from KY654518.1 and MZ516105.1 were added initially by v-build.pl and all other sequences were manually added based on testing results):

KY654518 (RSV A) blastx protein library

sequence name in .vadr.protein.fa file source sequence source sequence positions protein length reference model positions gene CDS product
KY654518.1/99..518:+ KY654518.1 99..518:+ 140 99..518:+ NS1 nonstructural protein 1
KY654518.1/628..1002:+ KY654518.1 628..1002:+ 125 628..1002:+ NS2 nonstructural protein 2
KY654518.1/1140..2315:+ KY654518.1 1140..2315:+ 392 1140..2315:+ N nucleoprotein
KY654518.1/2347..3072:+ KY654518.1 2347..3072:+ 242 2347..3072:+ P phosphoprotein
KY654518.1/3255..4025:+ KY654518.1 3255..4025:+ 257 3255..4025:+ M matrix protein
KY654518.1/4295..4489:+ KY654518.1 4295..4489:+ 65 4295..4489:+ SH small hydrophobic protein
KY654518.1/4681..5646:+ KY654518.1 4681..5646:+ 322 4681..5646:+ G attachment glcyoprotein
KY654518.1/5726..7450:+ KY654518.1 5726..7450:+ 575 5726..7450:+ F fusion glcyoprotein
KY654518.1/7669..8253:+ KY654518.1 7669..8253:+ 195 7669..8253:+ M2-1 M2-1 protein
KY654518.1/8228..8494:+ KY654518.1 8228..8494:+ 89 8228..8494:+ M2-2 M2-2 protein
KY654518.1/8561..15058:+ KY654518.1 8561..15058:+ 2166 8561..15058:+ L RNA-dependent RNA polymerase
proteins below were manually added
OM857255.1:4629..5591:+/4681..5643:+ OM857255.1 4629..5591:+ 321 4681..5643:+ G attachment glcyoprotein
AF065254.1:16..909:+/4681..5646:+ AF065254.1 16..909:+ 298 4681..5646:+ G attachment glcyoprotein
OK649616.1:4670..5563:+/4681..5646:+ OK649616.1 4670..5563:+ 298 4681..5646:+ G attachment glcyoprotein
hybrid:KY654518.1:4681..4695:+:AF065410.1:1..879:+/4681..5646:+ KY654518.1 4681..5646:+ 322 4681..5646:+ G attachment glcyoprotein
KF826850.1:4675..5568:+/4681..5646:+ KF826850.1 4675..5568:+ 298 4681..5646:+ G attachment glcyoprotein
KU316092.1:4620..5516:+/4681..5646:+ KU316092.1 4620..5516:+ 299 4681..5646:+ G attachment glcyoprotein
KU316164.1:4611..5504:+/4681..5646:+ KU316164.1 4611..5504:+ 298 4681..5646:+ G attachment glcyoprotein
NC_038235.1:4688..5584:+/4681..5649:+ NC_038235.1 4688..5584:+ 299 4681..5649:+ G attachment glcyoprotein
MZ515659.1:4681..5649:+/4681..5649:+ MZ515659.1 4681..5649:+ 323 4681..5649:+ G attachment glcyoprotein
HQ699266.1:1..897:+/4681..5649:+ HQ699266.1 1..897:+ 299 4681..5649:+ G attachment glcyoprotein
KJ641590.1:4630..5526:+/4681..5649:+ KJ641590.1 4630..5526:+ 299 4681..5649:+ G attachment glcyoprotein
OK649616.1:4670..5566:+/4681..5649:+ OK649616.1 4670..5566:+ 299 4681..5649:+ G attachment glcyoprotein
M17212.1:16..912:+/4681..5649:+ M17212.1 16..912:+ 299 4681..5649:+ G attachment glcyoprotein
OM857351.1:7614..8180:+/7669..8235:+ OM857351.1 7614..8180:+ 189 7669..8235:+ M2-1 M2-1 protein

MZ516105 (RSV B) blastx protein library

sequence name in .vadr.protein.fa file source sequence source sequence positions protein length reference model positions gene CDS product
MZ516105.1/99..518:+ MZ516105.1 99..518:+ 140 99..518:+ NS1 nonstructural protein 1
MZ516105.1/626..1000:+ MZ516105.1 626..1000:+ 125 626..1000:+ NS2 nonstructural protein 2
MZ516105.1/1139..2314:+ MZ516105.1 1139..2314:+ 392 1139..2314:+ N nucleoprotein
MZ516105.1/2347..3072:+ MZ516105.1 2347..3072:+ 242 2347..3072:+ P phosphoprotein
MZ516105.1/3262..4032:+ MZ516105.1 3262..4032:+ 257 3262..4032:+ M matrix protein
MZ516105.1/4301..4498:+ MZ516105.1 4301..4498:+ 66 4301..4498:+ SH small hydrophobic protein
MZ516105.1/4688..5620:+ MZ516105.1 4688..5620:+ 311 4688..5620:+ G attachment glcyoprotein
MZ516105.1/5718..7442:+ MZ516105.1 5718..7442:+ 575 5718..7442:+ F fusion glcyoprotein
MZ516105.1/7669..8256:+ MZ516105.1 7669..8256:+ 196 7669..8256:+ M2-1 M2-1 protein
MZ516105.1/8222..8494:+ MZ516105.1 8222..8494:+ 91 8222..8494:+ M2-2 M2-2 protein
MZ516105.1/8560..15060:+ MZ516105.1 8560..15060:+ 2167 8560..15060:+ L RNA-dependent RNA polymerase
proteins below were manually added
MT040088.1:4679..5572:+/4688..5620:+ MT040088.1 4679..5572:+ 298 4688..5620:+ G attachment glcyoprotein
KJ627364.1:4618..5550:+/4688..5620:+ KJ627364.1 4618..5550:+ 311 4688..5620:+ G attachment glcyoprotein
MH760718.1:4597..5529:+/4688..5620:+ MH760718.1 4597..5529:+ 311 4688..5620:+ G attachment glcyoprotein
MG642047.1:4666..5565:+/4688..5620:+ MG642047.1 4666..5565:+ 300 4688..5620:+ G attachment glcyoprotein
LC474547.1:4663..5595:+/4688..5620:+ LC474547.1 4663..5595:+ 311 4688..5620:+ G attachment glcyoprotein
MG431253.1:4674..5567:+/4688..5620:+ MG431253.1 4674..5567:+ 298 4688..5620:+ G attachment glcyoprotein
MZ962122.1:1..933:+/4688..5620:+ MZ962122.1 1..933:+ 311 4688..5620:+ G attachment glcyoprotein
KC297442.1:1..933:+/4688..5620:+ KC297442.1 1..933:+ 311 4688..5620:+ G attachment glcyoprotein
LC311384.1:1..933:+/4688..5620:+ LC311384.1 1..933:+ 311 4688..5620:+ G attachment glcyoprotein
MZ515748.1:4689..5621:+/4688..5620:+ MZ515748.1 4689..5621:+ 311 4688..5620:+ G attachment glcyoprotein
KC297470.1:1..882:+/4688..5629:+ KC297470.1 1..882:+ 294 4688..5629:+ G attachment glcyoprotein
KP856962.1:4618..5505:+/4688..5629:+ KP856962.1 4618..5505:+ 296 4688..5629:+ G attachment glcyoprotein
KU950619.1:4663..5604:+/4688..5629:+ KU950619.1 4663..5604:+ 314 4688..5629:+ G attachment glcyoprotein
KP258745.1:4620..5507:+/4688..5629:+ KP258745.1 4620..5507:+ 296 4688..5629:+ G attachment glcyoprotein
MF185751.1:4640..5527:+/4688..5629:+ MF185751.1 4640..5527:+ 296 4688..5629:+ G attachment glcyoprotein
KU316181.1:4618..5505:+/4688..5629:+ KU316181.1 4618..5505:+ 296 4688..5629:+ G attachment glcyoprotein
KJ627249.1:4618..5565:+/4688..5635:+ KJ627249.1 4618..5565:+ 316 4688..5635:+ G attachment glcyoprotein
NC_001781.1:4690..5589:+/4688..5641:+ NC_001781.1 4690..5589:+ 300 4688..5641:+ G attachment glcyoprotein
KU316144.1:4618..5517:+/4688..5641:+ KU316144.1 4618..5517:+ 300 4688..5641:+ G attachment glcyoprotein
MN365572.1:4676..5629:+/4688..5641:+ MN365572.1 4676..5629:+ 318 4688..5641:+ G attachment glcyoprotein
OK649740.1:4675..5574:+/4688..5641:+ OK649740.1 4675..5574:+ 300 4688..5641:+ G attachment glcyoprotein
LC474543.1:8538..15017:+/8560..15039:+ LC474543.1 8538..15017:+ 2160 8560..15039:+ L RNA-dependent RNA polymerase

Construction of the profile hidden Markov models (rsv.hmm)

The protein validation stage of v-annotate.pl can use profile HMMs with HMMER instead of blastx by using the --pv_hmmer option. By default, v-build.pl will create 'profile' HMMs from the single sequence proteins encoded by CDS in the model sequence, in this case, KY654518.1 and MZ516105.1. For the rsv v1.5-2 model library, for both the RSV A and RSV B models, additional attachment glycoprotein profile HMMs corresponding to the different possible stop codon positions were created and the default glycoprotein profile HMM was replaced with one built from an alignment of multiple sequences.

Specifically, the following additional profile HMMs were created:

reference model reference model positions gene number of sequences in hmmbuild alignment alignment file in hmm-alignments subdirectory
KY654518 (RSV A) 4681..5643 G 1 KY654518.4681..5643.protein.fa
KY654518 (RSV A) 4681..5646 G 7 KY654518.4681..5646.protein.stk
KY654518 (RSV A) 4681..5649 G 6 KY654518.4681..5649.protein.stk
MZ516105 (RSV B) 4688..5620 G 11 MZ516105.4688..5620.protein.stk
MZ516105 (RSV B) 4688..5629 G 6 MZ516105.4688..5629.protein.stk
MZ516105 (RSV B) 4688..5635 G 1 MZ516105.4688..5629.protein.fa
MZ516105 (RSV B) 4688..5641 G 4 MZ516105.4688..5641.protein.stk

These profile HMMs were added to the default set created by v-build.pl (after removing the original attachment glycoprotein profile HMMs) to make the RSV profile HMM library file rsv.hmm. All files necessary for reproducing the construction of rsv.hmm are available in the hmm-alignments subdirectory in the RSV 1.5-2 model directory. The following commands will recreate the rsv.hmm file within that subdirectory:

sh ./A.build.sh
sh ./B.build.sh
cat KY*hmm MZ*hmm > rsv.hmm

Link to related RSV model building tutorial

A tutorial based on building an RSV model library is included in the VADR documentation. That tutorial includes a long discussion of how to build a model library and includes a procedure similar to the one taken to build the RSV vadr-models-rsv-1.5-2 model set.



Reference

  • The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3

  • The following article describes changes made to VADR for faster SARS-CoV-2 annotation, including the -r option: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genom Bioinform. Vol 5, Issue 1 (2023) https://doi.org/10.1093/nargab/lqad002


Questions, comments, feature or model requests? Send a mail to [email protected].

⚠️ **GitHub.com Fallback** ⚠️