Coronavirus annotation - ncbi/vadr Wiki
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using a larger VADR model library
How to annotate SARS-CoV-2 sequences with VADR v1.4.1:
Download and install the latest version of VADR, following the instructions on this page
Download the latest SARS-CoV-2 vadr models (version 1.3-2, gzipped tarball) from here, unpack them (e.g.
tar xfz <tarball.gz>). Note the path to the directory name created (
<sarscov2-models-dir-path>) for step 3.
Remove terminal ambiguous nucleotides from your input fasta sequence file using the
$VADRSCRIPTSDIR/miniscripts/. The GenBank SARS-CoV-2 submission processing pipeline removes ambiguous nucleotides from the beginning and ends of sequences and also sequences that are less than 50nt or more than 30,000nt (after trimming) prior to running VADR, so to make sure your local VADR results are consistent with GenBank's VADR results, you should trim terminal ambiguous nucleotides first.
fasta-trim-terminal-ambigs.plscript will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.
To remove terminal ambiguous nucleotides from your sequence file
<input-fasta-file>and to remove short and long sequences to create a new trimmed file
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 30000 <input-fasta-file> > <trimmed-fasta-file>
v-annotate.plprogram on an input trimmed fasta file with SARS-CoV-2 sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).
NOTE: The following command runs multithreaded on up to 8 CPUs, and so is only recommended if you have at least 8 CPUs and 16Gb RAM available. To run on
<n>CPUs instead, replace
--cpu <n>. To run single threaded on a single CPU remove the
--cpu 8option. The
--cpuoptions are incompatible with
v-annotate.pl --split --cpu 8 --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --mdir <sarscov2-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>
SARS-CoV-2 VADR models
As of December 3, 2021, the VADR model library used by GenBank for
SARS-CoV-2 annotation (
includes a single SARS-CoV-2 model based on the
RefSeq sequence of length 29903 nt. Previously, four SARS-CoV-2 models
were used but our internal testing showed that using multiple models
sometimes leads to undesired results. Some features introduced in vadr
v1.3 make it so that using the single RefSeq model is sufficient.
The following table includes information on the 12 CDS features in the NC_045512 RefSeq:
Many alerts/errors in ORF3a, ORF6, ORF7a, ORF7b, ORF8, and ORF10 do not cause a sequence to FAIL
As of August 5, 2021, many common alerts (e.g. early stop codons,
frameshift mutations, etc.) in ORF3a, ORF6, ORF7a, ORF7b, ORF8, and
ORF10 will no longer cause a sequence to fail VADR as they did
previously. Instead such problems will cause that feature to not be
annotated as a CDS but rather as a
misc_feature, and no protein
translation product and corresponding entry in the GenBank Protein
database will be created. (Since February 2021, ORF8 was
misc_featur-izable in this way, but ORF3a, ORF6, ORF7a, ORF7b, and
ORF10 were not.)
For a list of what specific alerts for a CDS will cause the feature
to be annotated as a
misc_feature instead of failing the sequence,
see the third column of this
or use the
--alt_list option with
A sequence will fail if 3 or more of the listed ORF features are
annotated as misc_features with the
(TOO_MANY_MISC_FEATURES). In the GenBank submission portal, this
alert will trigger manual review by GenBank staff.
Because the alerts related to these problems no longer cause failure
they will not be reported in the
.alt.list or GenBank submission
portal detailed error report
.tsv file but information on them will
still be output to the VADR
.alt output file.
Example annotation of SARS-CoV-2 sequences
This section shows output from an example
v-annotate.pl annotation of three
SARS-CoV-2 sequences from GenBank using the same command and options that
GenBank currently uses to screen incoming SARS-CoV-2 sequences.
The fasta file of those three sequences can be downloaded here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the SARS-CoV-2 model directory is in
pretrim.sars-cov2.4.fa sequence file is in the current directory. We will create a new file
sars-cov2.4.fa with trimmed sequences in the next step.
As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 50nt or longer than 30,000nt with the command:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 30000 pretrim.sars-cov2.4.fa > sars-cov2.4.fa
Next, to annotate the trimmed sequences using the recommended
v-annotate.pl options for SARS-CoV-2, run the following command (scroll to the
right to see full command), which will create a new directory called
my4 into which VADR output files will be written.
v-annotate.pl --split --cpu 8 --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --mdir /usr/local/vadr-models-sarscov2-1.3-2 sars-cov2.4.fa my4
The options used are explained further below.
When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a model library # VADR 1.4 (Dec 2021) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # date: Tue Dec 21 18:00:51 2021 # $VADRBIOEASELDIR: /usr/local/vadr-install-1.3/Bio-Easel # $VADRBLASTDIR: /usr/local/vadr-install-1.3/ncbi-blast/bin # $VADREASELDIR: /usr/local/vadr-install-1.3/infernal/binaries # $VADRINFERNALDIR: /usr/local/vadr-install-1.3/infernal/binaries # $VADRMODELDIR: /usr/local/vadr-install-1.3/vadr-models-calici # $VADRSCRIPTSDIR: /usr/local/vadr-install-1.3/vadr # # sequence file: sars-cov2.4.fa # output directory: my4 # specify that alert codes in <s> cause FAILure: lowscore,insertnn,deletinn [--alt_fail] # .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': sarscov2 [--mkey] # model files are in directory <s>, not in $VADRMODELDIR: /usr/local/vadr-models-sarscov2-1.3-2 [--mdir] # in feature table for failed seqs, never change feature type to misc_feature: yes [--nomisc] # lowsim5s/LOW_SIMILARITY_START minimum length is <n>: 6 [--lowsim5seq] # lowsim3s/LOW_SIMILARITY_END minimum length is <n>: 6 [--lowsim3seq] # align with glsearch from the FASTA package, not to a cm with cmalign: yes [--glsearch] # use top-scoring HSP from blastn to seed the alignment: yes [-s] # replace stretches of Ns with expected nts, where possible: yes [-r] # split input file into chunks, run each chunk separately: yes [--split] # parallelize across <n> CPU workers (requires --split or --glsearch): 8 [--cpu] # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
v-annotate.pl will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.3 seconds] # Splitting sequence file into chunks to run independently in parallel on 8 processors ... done. [ 0.9 seconds] # Executing 4 scripts in parallel on 8 processors to process 4 partition(s) of all 4 sequence(s) ... # 4 of 4 jobs finished (0.1 minutes spent waiting) # done. [ 12.4 seconds] # Merging and finalizing output ... done. [ 0.8 seconds]
--split --cpu 8 options, the input fasta script is split up
into chunks, in this case placing one sequence per chunk but typically
about 10 sequences per chunk for larger files, and runs
v-annotate.pl separately on those chunks on 8 different CPUs in
parallel. When all sequences are finished processing the main script
merges the output together.
v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:
# Summary of classified sequences: # # num num num #idx model group subgroup seqs pass fail #--- --------- ------------ ---------- ---- ---- ---- 1 NC_045512 Sarbecovirus SARS-CoV-2 4 2 2 #--- --------- ------------ ---------- ---- ---- ---- - *all* - - 4 2 2 - *none* - - 0 0 0 #--- --------- ------------ ---------- ---- ---- ---- # # Summary of reported alerts: # # alert causes short per num num long #idx code failure description type cases seqs description #--- -------- ------- --------------------------- ------- ----- ---- ----------- 1 cdsstopn yes* CDS_HAS_STOP_CODON feature 4 2 in-frame stop codon exists 5' of stop position predicted by homology to reference 2 cdsstopp yes* CDS_HAS_STOP_CODON feature 4 2 stop codon in protein-based alignment 3 peptrans yes* PEPTIDE_TRANSLATION_PROBLEM feature 26 1 mat_peptide may not be translated because its parent CDS has a problem #--- -------- ------- --------------------------- ------- ----- ---- -----------
And finally a list of the output files created:
# Output printed to screen saved in: my4.vadr.log # List of executed commands saved in: my4.vadr.cmd # List and description of all output files saved in: my4.vadr.filelist # esl-seqstat -a output for input fasta file saved in: my4.vadr.seqstat # 5 column feature table output for passing sequences saved in: my4.vadr.pass.tbl # 5 column feature table output for failing sequences saved in: my4.vadr.fail.tbl # list of passing sequences saved in: my4.vadr.pass.list # list of failing sequences saved in: my4.vadr.fail.list # list of alerts in the feature tables saved in: my4.vadr.alt.list # fasta file with passing sequences saved in: my4.vadr.pass.fa # fasta file with failing sequences saved in: my4.vadr.fail.fa # per-sequence tabular annotation summary file saved in: my4.vadr.sqa # per-sequence tabular classification summary file saved in: my4.vadr.sqc # per-feature tabular summary file saved in: my4.vadr.ftr # per-model-segment tabular summary file saved in: my4.vadr.sgm # per-alert tabular summary file saved in: my4.vadr.alt # alert count tabular summary file saved in: my4.vadr.alc # per-model tabular summary file saved in: my4.vadr.mdl # alignment doctoring tabular summary file saved in: my4.vadr.dcr # ungapped seed alignment summary file (-s) saved in: my4.vadr.sda # replaced stretches of Ns summary file (-r) saved in: my4.vadr.rpn # # All output files created in directory ./my4/ # # Elapsed time: 00:00:13.71 # hh:mm:ss # [ok]
Note that all the output files will be in the newly created
Summary of classified sequences listed that two sequences passed and two failed.
my4.vadr.pass.list, lists the two sequences that passed:
my4.vadr.fail.list lists the two sequences that failed:
Also, FASTA-formatted sequence files for each the passing and failing
For the two sequences that passed, the annotation is available in the
my4.vadr.pass.tbl file and for the two sequences that failed the
annotation is in the
my.vadr.pass.tbl: (with the middle of the table for each sequence removed for brevity)
>Feature MT159720.1 266 21555 gene gene ORF1ab 266 13468 CDS 13468 21555 product ORF1ab polyprotein exception ribosomal slippage protein_id MT159720.1_1 266 13483 CDS product ORF1a polyprotein protein_id MT159720.1_2 266 805 mat_peptide product leader protein protein_id MT159720.1_1 ...snip... 28274 29533 gene gene N 28274 29533 CDS product nucleocapsid phosphoprotein protein_id MT159720.1_11 29558 29674 gene gene ORF10 29558 29674 CDS product ORF10 protein protein_id MT159720.1_12 29609 29644 stem_loop note Coronavirus 3' UTR pseudoknot stem-loop 1 29629 29657 stem_loop note Coronavirus 3' UTR pseudoknot stem-loop 2 29728 29768 stem_loop note Coronavirus 3' stem-loop II-like motif (s2m) >Feature MT308693.1 217 21506 gene gene ORF1ab 217 13419 CDS 13419 21506 product ORF1ab polyprotein exception ribosomal slippage protein_id MT308693.1_1 217 13434 CDS product ORF1a polyprotein protein_id MT308693.1_2 217 756 mat_peptide product leader protein protein_id MT308693.1_1 ...snip... 28225 29484 gene gene N 28225 29484 CDS product nucleocapsid phosphoprotein protein_id MT308693.1_11 29509 29625 gene gene ORF10 29509 29625 CDS product ORF10 protein protein_id MT308693.1_12 29560 29595 stem_loop note Coronavirus 3' UTR pseudoknot stem-loop 1 29580 29608 stem_loop note Coronavirus 3' UTR pseudoknot stem-loop 2 29679 29719 stem_loop note Coronavirus 3' stem-loop II-like motif (s2m)
And the second sequence in the
my.vadr.fail.tbl (with the middle removed for brevity):
>Feature MW422255.1/21610-T-to-A 212 21492 gene gene ORF1ab 212 13405 CDS 13405 21492 product ORF1ab polyprotein exception ribosomal slippage protein_id MW422255.1/21610-T-to-A_1 212 13420 CDS product ORF1a polyprotein protein_id MW422255.1/21610-T-to-A_2 212 751 mat_peptide product leader protein protein_id MW422255.1/21610-T-to-A_1 ...snip... 29556 29584 stem_loop note Coronavirus 3' UTR pseudoknot stem-loop 2 29655 29695 stem_loop note Coronavirus 3' stem-loop II-like motif (s2m) Additional note(s) to submitter: ERROR: CDS_HAS_STOP_CODON: (CDS:surface glycoprotein) in-frame stop codon exists 5' of stop position predicted by homology to reference [TAA, shifted S:3702,M:3711]; seq-coords:21608..21610:+; mdl-coords:21671..21673:+; mdl:NC_045512; ERROR: CDS_HAS_STOP_CODON: (CDS:surface glycoprotein) stop codon in protein-based alignment [-]; seq-coords:21608..21610:+; mdl-coords:21671..21673:+; mdl:NC_045512;
Feature table format is described at https://www.ncbi.nlm.nih.gov/Sequin/table.html.
Note that the end of the
fail.tbl file lists ERRORs for
MW422255.1/21610-T-to-A, the fourth sequence in the input file,
which is identical to GenBank sequence
MW422255.1 with position
21610 changed from a
T to a
A to purposefully introduce an early
stop codon in the surface glycoprotein (spike) for purposes of this
example. Early stop codons are one reason a sequence can fail. Note
CDS_HAS_STOP_CODON errors at the end of the feature table.
The annotation information is also available in other files with
different formats, such as the
my4/my4.vadr.ftr file, which may be
easier to parse for some applications.
How to interpret VADR alert/error messages and other output
For examples of alerts/errors and more information on how to interpret the
VADR output related to those alerts in the output feature tables,
file and the GenBank submission portal detailed error report
files, see this vadr documentation page.
See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.
You can find the VADR 1.0 paper (reference below).
Explanation of options used in example command
The options used in the above command are the recommended set of
options for SARS-CoV-2 annotation because they are currently (as of
January 10, 2022) being used with vadr 1.4.1 by NCBI sequence indexers when they evaluate
an incoming SARS-CoV-2 sequence submission. The options are each briefly
explained in the table below. More information can be found
-s options are explained more below the table as well.
||split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to
||for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 8 CPU workers (8 can be changed to
||use the glsearch program from the FASTA package for the alignment stage, not cmalign, reducing required memory|
||turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment|
||turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible|
||specify that features for failing sequences not be changed to misc_features in the output .tbl file|
||use the model files with prefix
||set the minimum length for an alert related to sequence of low similarity to the RefSeq at the 5' sequence end to
||set the minimum length for an alert related to sequence of low similarity to the RefSeq at the 3' sequence end to
||specify that lowscore alert and large insertions and deletions (insertnn,deletinn) cause a sequence to fail|
||specify that the models to use are in directory /usr/local/vadr-models-sarscov2-1.3-2|
-r option replaces stretches of Ns in the input sequences with
the expected nucleotides from the RefSeq. Be cautious, as this
strategy actually replaces Ns in your sequence prior to determination
of annotation. If this is not what you want to do, then do not use
-r is the current (as of May 7, 2020) practice for NCBI
indexers analyzing incoming SARS-CoV-2 sequences, which is why it is
included as a recommended option here. By doing this the Ns are
assumed to be the 'expected' sequence in the corresponding regions for
purposes of annotation. More information on
information on other related options is
The second sequence in the
sars-cov2.4.fa file includes 344 Ns,
most of which are contained within consecutive stretches of 36 and 290 Ns in two
regions, from sequence positions 11487-11522 and 19237-19526.
v-annotate.pl replaces the Ns in these two stretches with the
corresponding 'expected' nucleotides from the NC_045512 RefSeq, and then determines annotation
with that doctored sequence. Some information on the Ns in each sequence including on those that were replaced is output to a file with suffix
(format described here).
For the example above, the
my4.vadr.rpn file looks like this (scroll to right to see full file):
#seq seq seq num_Ns num_Ns fract_Ns nregs nregs nregs nregs nregs nnt nnt detail_on_regions[S:seq,M:mdl,D:lendiff,N:#Ns, #idx name len model p/f tot rp rp tot int rp rp-full rp-part rp-full rp-part E:#non_N_match_expected,F:flush_direction,R:region_replaced?]; #--- ----------------------- ----- --------- ---- ------ ------ -------- ----- ----- ----- ------- ------- ------- ------- -------------------------------------------------------------- 1 MT159720.1 29882 NC_045512 PASS 0 0 - 0 0 0 0 0 0 0 - 2 MT308693.1 29788 NC_045512 PASS 344 326 0.948 2 2 2 2 0 326 0 [S:11487..11522,M:11536..11571,D:0,N:36/36,E:?/?,F:-,R:Y];[S:19237..19526,M:19286..19575,D:0,N:290/290,E:?/?,F:-,R:Y]; 3 MT159720.1/1406-G-to-T 29882 NC_045512 FAIL 0 0 - 0 0 0 0 0 0 0 - 4 MW422255.1/21610-T-to-A 29763 NC_045512 FAIL 8 0 0.000 0 0 0 0 0 0 0 -
If we run the above command without
-r, the Ns are not replaced and
MT308693.1 sequence fails because the N regions trigger
alerts/errors to be reported. Specifically, the following three errors
are reported in the
fail.tbl file, amongst other
PEPTIDE_TRANSLATION_PROBLEM errors for mature peptides (not shown):
ERROR: LOW_FEATURE_SIMILARITY: (gene:ORF1ab) region within annotated feature that does not match a CDS lacks significant similarity [36 nt overlap b/t low similarity region of length 36 (11487..11522) and annotated feature (217..21506)]; seq-coords:11487..11522:+; mdl-coords:11536..11571:+; mdl:NC_045512; ERROR: LOW_FEATURE_SIMILARITY: (gene:ORF1ab) region within annotated feature that does not match a CDS lacks significant similarity [290 nt overlap b/t low similarity region of length 290 (19237..19526) and annotated feature (217..21506)]; seq-coords:19237..19526:+; mdl-coords:19286..19575:+; mdl:NC_045512; ERROR: INDEFINITE_ANNOTATION_END: (CDS:ORF1ab polyprotein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [2271>8, valid stop codon in nucleotide-based prediction]; seq-coords:19236..21506:+; mdl-coords:21555..21555:+; mdl:NC_045512;
Note that the regions that are causing the
errors are stretches of Ns. Using
-r allows some sequences like this that
only fail due to the Ns to pass, and sometimes changes their
annotation to be more accurate (based on the assumption that the replaced Ns
represent the expected nucleotides at the corresponding positions).
-s option speeds up
v-annotate.pl by identifying seeding
the global alignment of each sequence using the top-scoring HSP
from blastn, keeping that region fixed
and only using a more expensive alignment program
(glsearch or cmalign) for the remainder of the sequence.
This heuristic works particularly well for many SARS-CoV-2 sequences
that are highly similar (~99% identical) to the RefSeq NC_045512. If
we run the above example command without
-s, it would require about 30
seconds per sequence instead of less than ten seconds per sequence. When
-s is used, an additional output file with suffix
.sda is output
(format described here).
More information on
-s, including information on other related
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using a larger VADR model library
The command above will only compare your input sequences to the
SARS-CoV-2 NC_045512 model described above. If you want to annotate
v-annotate.pl to perform the additional step of checking if each
input sequence is more similar to SARS-CoV-2 or to a different
Coronaviridae RefSeq, or if your input file contains non-SARS-CoV-2
Coronaviridae sequences, you can download a different model file that includes
the SARS-CoV-2 model and 54 other Coronaviridae RefSeqs (v1.3-3, gzipped tarball) from
After downloading and unpacking that model set,
--mkey option in the example
command above from
But be aware that changing this option will result in slower running times.
Building new Coronaviridae models
You can use VADR's
v-build.pl program to build additional models of
Coronaviridae GenBank sequences. More information is here.
Additional VADR documentation
- VADR README
- VADR installation instructions
v-build.plexample usage and command-line options
v-annotate.plexample usage, command-line options and alert information
- Explanations and examples of
v-annotate.pldetailed alert and error messages
- Output fields with detailed alert and error messages
- Explanation of sequence and model coordinate fields in
toy50toy model used in examples of alert messages
- Examples of different alert types and corresponding
- Posterior probability annotation in VADR output Stockholm alignments
- VADR output file formats
Version history for GenBank submission processing
|date||vadr version||models version||command-line options||models||notes|
|Jun 22, 2020||1.1||1.1-1||
||NC_045512||first version with
|Jul 28, 2020||1.1.1||1.1-1||unchanged||NC_045512||NC_045512|
|Dec 3, 2020||1.1.2||1.1-1||
||NC_045512||only minor changes related to SARS-CoV-2|
|Jan 8, 2021||1.1.2||1.1.2-1||unchanged||NC_045512 and MW422255-NC_045512||adds B.1.1.7 model|
|Feb 19, 2021||1.1.3||1.1.3-1||
||NC_045512, MW422255-NC_045512 and NC_045512-del28254||adds model, many problems in ORF8 no longer cause failure|
|Apr 13, 2021||1.2||1.2-2||
||NC_045512, MW422255-NC_045512, NC_045512-del28254 and NC_045512-MW809059||significantly lower memory requirements, faster processing due to parallelization, adds B.1.525 model|
|Aug 5, 2021||1.3||1.3-1||
||NC_045512, MW422255-NC_045512, NC_045512-del28254 and NC_045512-MW809059||many problems in ORF3a, ORF6, ORF7a, ORF7b and ORF10 no longer cause failure|
|Dec 2, 2021||1.3||1.3-2||unchanged||NC_045512 only||reverts to single NC_045512 model|
|Dec 23, 2021||1.4||1.3-2||unchanged||NC_045512 only||blastn usage changed to yield more acceleration with
|Jan 11, 2022||1.4.1||1.3-2||unchanged||NC_045512 only||fixes bug that prevented some dupregin alerts (DUPLICATE_REGIONS) from being reported, TOO_MANY_MISC_FEATURES minimum raised from 3 to 4 misc_features|
- The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3