Mpox virus annotation - ncbi/vadr GitHub Wiki


How to annotate MPXV sequences with VADR v1.5:

The steps below explain how to use VADR for MPXV validation and annotation. These are only recommendations. Testing of VADR for MPXV is still ongoing and these recommendations are subject to change, possibly significantly. Many 'good' MPXV sequences that do not contain any sequencing or assembly artifacts may 'fail' v-annotate.pl using the steps below.

GenBank is not currently using VADR to automatically process MPXV sequence submissions like it does for SARS-CoV-2. GenBank MPXV sequence submissions are still all manually reviewed, but indexers might use VADR as part of their analysis, using the command-line options listed in step 4 below.

WARNING: VADR is slow for long sequences like MPXV and in rare cases it can take 30 minutes or more to annotate some MPXV sequences that differ from the RefSeq NC_063383 in certain ways. The vast majority of sequences should take less than one minute.

Steps for using VADR for MPXV annotation:

  1. Download and install the latest version of VADR (v1.5), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.5 docker image created by Curtis Kapsak (docker image names: staphb/vadr:1.5 and staphb/vadr:latest), available on dockerhub and quay. A brief README for the docker image is here.

  2. Download the latest MPXV VADR models (version 1.4.2-1, gzipped tarball) from here, unpack them (e.g. tar xfz <tarball.gz>). Note the path to the directory name created (<mpxv-models-dir-path>) for step 3. (If you are using the docker image you can skip this step as the MPXV VADR models are included.)

  3. Remove terminal ambiguous nucleotides from your input fasta sequence file using the fasta-trim-terminal-ambigs.pl script in $VADRSCRIPTSDIR/miniscripts/. GenBank processing of viral sequences typically includes removing ambiguous nucleotides from the beginning and ends of sequences, and also removing sequences that are less than 50nt or more than 210,000nt (after trimming).

    WARNING: the fasta-trim-terminal-ambigs.pl script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.

    To remove terminal ambiguous nucleotides from your sequence file <input-fasta-file> and to remove short and long sequences to create a new trimmed file <trimmed-fasta-file>, execute:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 210000 <input-fasta-file> > <trimmed-fasta-file>
  1. Run the v-annotate.pl program on an input trimmed fasta file with MPXV sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).

    NOTE: The following command runs multithreaded on up to 4 CPUs, and so is only recommended if you have at least 4 CPUs and 16Gb RAM available. To run on <n> CPUs instead, replace --cpu 4 with --cpu <n>. To run single threaded on a single CPU remove the --cpu 4 option. The --split and --cpu options are incompatible with -p.

v-annotate.pl --split --cpu 4 --glsearch --minimap2 -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --s_overhang 150 --mdir <mpxv-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>

MPXV VADR model

As of July 15, 2022, the VADR model library for Mpox annotation (vadr-models-mpxv-1.4.2-1 model file includes a single MPXV model based on the NC_063383 RefSeq sequence of length 197,209 nt.

This model was created using the following command with VADR v1.4.2 on July 13, 2022:

v-build.pl --forcelong -f --skipbuild NC_063383 NC_063383

Many of the output NC_063383 files were then renamed to use 'mpxv' instead of 'NC_063383'. Additionally, the first line of mpxv.minfo file was changed from

MODEL NC_063383 blastdb:"NC_063383.vadr.protein.fa" length:"197209"

to

MODEL NC_063383 blastdb:"NC_063383.vadr.protein.fa" length:"197209" group:"Orthopoxvirus" subgroup:"Monkeypox_virus" indfstrn_exc:"1..6422:+;190788..197209:+;"

The indfstrn_exc part of the addition makes is that an indfstrn (indefinite strand) alert is not reported due to the repeated sequence at the beginning and end of the mpox genome (positions 1 to 6422 and the complement of positions 190788 to 197209).


This section shows output from an example v-annotate.pl annotation of three MPXV sequences from GenBank using the above command and options.

The fasta file of those three sequences can be downloaded here.

(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)

For this example, the MPXV model directory is in /usr/local/vadr-models-mpxv-1.4.2-1 and the pretrim.mpxv.3.fa sequence file is in the current directory. We will create a new file mpxv.3.fa with trimmed sequences in the next step.

As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 50nt or longer than 210,000nt with the command:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 210000 pretrim.mpxv.3.fa > mpxv.3.fa

Next, to annotate the trimmed sequences using the above v-annotate.pl options for MPXV, run the following command (scroll to the right to see full command), which will create a new directory called my3 into which VADR output files will be written.

v-annotate.pl --split --cpu 4 --glsearch --minimap2 -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --s_overhang 150 --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3

The options used are explained further below.

When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.5 (Sep 2022)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Thu Sep 29 11:26:10 2022
# $VADRBIOEASELDIR:  /usr/local/vadr-install-1.5/Bio-Easel
# $VADRBLASTDIR:     /usr/local/vadr-install-1.5/ncbi-blast/bin
# $VADREASELDIR:     /usr/local/vadr-install-1.5/infernal/binaries
# $VADRINFERNALDIR:  /usr/local/vadr-install-1.5/infernal/binaries
# $VADRMINIMAP2DIR:  /usr/local/vadr-install-1.5/minimap2
# $VADRMODELDIR:     /usr/local/vadr-install-1.5/vadr-models-calici
# $VADRSCRIPTSDIR:   /usr/local/vadr-install-1.5/vadr
#
# sequence file:                                                                  mpxv.3.fa
# output directory:                                                               my3
# specify that alert codes in <s> do not cause FAILure:                           discontn,dupregin [--alt_pass]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  mpxv [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /usr/local/vadr-models-mpxv-1.4.2-1 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature:    yes [--nomisc]
# align with glsearch from the FASTA package, not to a cm with cmalign:           yes [--glsearch]
# use top-scoring HSP from blastn to seed the alignment:                          yes [-s]
# for -s, set length of nt overhang for subseqs to align to <n>:                  150 [--s_overhang]
# use minimap2 to derive seed and use it if >= length of blastn seed:             yes [--minimap2]
# replace stretches of Ns with expected nts, where possible:                      yes [-r]
# with -r, do not report lowsim{5,3,i}s within identified N-rich regions:         yes [--r_lowsimok]
# w/--r_lowsimok, maximum length of N-rich region is <n>:                         2000 [--r_lowsimxl]
# w/--r_lowsimok, max diff from expected length for N-rich region is <n>:         100 [--r_lowsimxd]
# split input file into chunks, run each chunk separately:                        yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch):            4 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:

# Validating input                                                                                          ... done. [    0.4 seconds]
# Splitting sequence file into chunks to run independently in parallel on 4 processors                      ... done. [    0.5 seconds]
# Executing 3 scripts in parallel on 4 processors to process 3 partition(s) of all 3 sequence(s)            ... 
#	   3 of    3 jobs finished (0.1 minutes spent waiting)
# done. [   46.9 seconds]
# Merging and finalizing output                                                                             ... done. [    3.1 seconds]

With the --split --cpu 4 options, the input fasta script is split up into chunks, in this case placing one sequence per chunk but typically about 2 sequences per chunk for larger files, and runs v-annotate.pl separately on those chunks on 4 different CPUs in parallel. When all sequences are finished processing the main script merges the output together.

The v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:

# Summary of classified sequences:
#
#                                                 num   num   num
#idx  model      group          subgroup         seqs  pass  fail
#---  ---------  -------------  ---------------  ----  ----  ----
1     NC_063383  Orthopoxvirus  Monkeypox_virus     3     2     1
#---  ---------  -------------  ---------------  ----  ----  ----
-     *all*      -              -                   3     2     1
-     *none*     -              -                   0     0     0
#---  ---------  -------------  ---------------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  --------------------------  --------  -----  ----  -----------
1     dupregin  no       DUPLICATE_REGIONS           sequence     73     3  similarity to a model region occurs more than once
2     discontn  no       DISCONTINUOUS_SIMILARITY    sequence      3     3  not all hits are in the same order in the sequence and the homology model
3     ambgntrp  no       N_RICH_REGION_NOT_REPLACED  sequence      4     3  N-rich region of unexpected length not replaced
#---  --------  -------  --------------------------  --------  -----  ----  -----------
4     indf3pst  yes      INDEFINITE_ANNOTATION_END    feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#---  --------  -------  --------------------------  --------  -----  ----  -----------

And finally a list of the output files created:

# Output printed to screen saved in:                              my3.vadr.log
# List of executed commands saved in:                             my3.vadr.cmd
# List and description of all output files saved in:              my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   my3.vadr.fail.tbl
# list of passing sequences saved in:                             my3.vadr.pass.list
# list of failing sequences saved in:                             my3.vadr.fail.list
# list of alerts in the feature tables saved in:                  my3.vadr.alt.list
# fasta file with passing sequences saved in:                     my3.vadr.pass.fa
# fasta file with failing sequences saved in:                     my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          my3.vadr.sqa
# per-sequence tabular classification summary file saved in:      my3.vadr.sqc
# per-feature tabular summary file saved in:                      my3.vadr.ftr
# per-model-segment tabular summary file saved in:                my3.vadr.sgm
# per-alert tabular summary file saved in:                        my3.vadr.alt
# alert count tabular summary file saved in:                      my3.vadr.alc
# per-model tabular summary file saved in:                        my3.vadr.mdl
# alignment doctoring tabular summary file saved in:              my3.vadr.dcr
# seed alignment summary file (-s) saved in:                      my3.vadr.sda
# replaced stretches of Ns summary file (-r) saved in:            my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time:  00:00:49.90
#                hh:mm:ss
# 
[ok]

Note that all the output files will be in the newly created my3 directory. The Summary of classified sequences listed that two sequences passed and one failed. The file my3.vadr.pass.list, lists the two sequences that passed:

ON853669.1
ON843168.1

and my3.vadr.fail.list lists the sequence that failed:

ON803440.1

Also, FASTA-formatted sequence files for each the passing and failing sequences are my3.vadr.pass.fa and my3.vadr.fail.fa.

For the two sequences that passed, the annotation is available in the output my3.vadr.pass.tbl file and for the two sequences that failed the annotation is in the my3.vadr.fail.tbl file.

Here is the output for the first sequence in the my3.vadr.pass.tbl file: (with the middle of the table for each sequence removed for brevity)

>Feature ON853669.1
1572	832	gene
			gene	OPG001
1572	832	CDS
			product	Chemokine binding protein
			protein_id	ON853669.1_1
2748	1699	gene
			gene	OPG002
2748	1699	CDS
			product	Crm-B secreted TNF-alpha-receptor-like protein
			protein_id	ON853669.1_2
4604	2838	gene
			gene	OPG003
4604	2838	CDS
			product	Ankyrin repeat protein (25)
			protein_id	ON853669.1_3

...snip...

191058	192371	CDS
			product	Ankyrin repeat protein (39)
			protein_id	ON843168.1-like_176
192607	194373	gene
			gene	OPG003
192607	194373	CDS
			product	Ankyrin repeat protein (25)
			protein_id	ON843168.1-like_177
194463	195512	gene
			gene	OPG002
194463	195512	CDS
			product	Crm-B secreted TNF-alpha-receptor-like protein
			protein_id	ON843168.1-like_178
195639	196379	gene
			gene	OPG001
195639	196379	CDS
			product	Chemokine binding protein
			protein_id	ON843168.1-like_179

And the sequence in the my3.vadr.fail.tbl (with the middle removed for brevity):

>Feature ON803440.1
1572	832	gene
			gene	OPG001
1572	832	CDS
			product	Chemokine binding protein
			protein_id	ON803440.1_1
2748	1699	gene
			gene	OPG002
2748	1699	CDS
			product	Crm-B secreted TNF-alpha-receptor-like protein
			protein_id	ON803440.1_2
4604	2838	gene
			gene	OPG003
4604	2838	CDS
			product	Ankyrin repeat protein (25)
			protein_id	ON803440.1_3

...snip...
192596	194362	CDS
			product	Ankyrin repeat protein (25)
			protein_id	ON803440.1_177
194452	195501	gene
			gene	OPG002
194452	195501	CDS
			product	Crm-B secreted TNF-alpha-receptor-like protein
			protein_id	ON803440.1_178
195628	196368	gene
			gene	OPG001
195628	196368	CDS
			product	Chemokine binding protein
			protein_id	ON803440.1_179

Additional note(s) to submitter:
ERROR: INDEFINITE_ANNOTATION_END: (CDS:CPXV166 protein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [30>8, valid stop codon in nucleotide-based prediction]; seq-coords:140165..140194:+; mdl-coords:140205..140205:+; mdl:NC_063383;

Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.

Note that the end of the fail.tbl file lists ERRORs for ON803440.1. In this case, the 3' end of the CPXV166 protein is not as similar as expected with the NC_063383 RefSeq protein. This doesn't mean that there is a problem with a sequence, just that this is one possible difference with the RefSeq that might indicate a problem. It also might represent normal sequence divergence in the virus. Many of the VADR ERRORs are meant to highlight potential problems for manual review or regions of interest to the user.

The annotation information is also available in other files with different formats, such as the my3/my3.vadr.ftr file, which may be easier to parse for some applications.

How to interpret VADR alert/error messages and other output

For examples of alerts/errors and more information on how to interpret the VADR output related to those alerts in the output feature tables, .alt.list file and the GenBank submission portal detailed error report .tsv files, see this vadr documentation page.

See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.

You can find information on the VADR 1.0 paper and a submitted manuscript that describes some of the acceleration strategies that VADR uses for viruses with relatively long genomes (below).

Explanation of options used in example command

The options used in the above command are the recommended set of options for MPXV annotation. These options are currently used by GenBank indexers when they run VADR for validating and annotating incoming MPXV sequence submissions. (All submissions are currently manually reviewed, no automated processing/approval using VADR is in place for MPXV sequences at this time.) The options are each briefly explained in the table below. More information can be found here.

option explanation
--split split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
--cpu 4 for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 4 CPU workers (4 can be changed to <n1> with --cpu <n1>), requires --split
--glsearch use the glsearch program from the FASTA package for the alignment stage, required for MPXV annotation
-s turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment, required for MPXV annotation
-r turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
--minimap2 use the minimap2 program to derive seeds and use them instead of blastn-based seeds if they are longer
--nomisc specify that features for failing sequences not be changed to misc_features in the output .tbl file
--mkey mpxv use the model files with prefix mpxv in the directory from --mdir
--r_lowsimok do not report LOW_SIMILARITY* errors due to N-rich regions
--r_lowsimxl 2000 for --r_lowsimok, LOW_SIMILARITY* errors will not be reported only for low similarity regions of at most 2000nt
--r_lowsimxd 100 for --r_lowsimok, LOW_SIMILARITY* errors will not be reported only for low similarity regions in which the difference in expected length compared to the model RefSeq is at most 100nt
--alt_pass discont,dupregin specify that discontn and dupregion alerts that occur due to repetitive regions, which are common in MPXV sequences, do NOT cause a sequence to fail
--s_overhang 150 specify that the number of nucleotides overlap between the seed and flanking regions aligned with glsearch be 150 nt
--mdir /usr/local/vadr-models-mpxv-1.4.2-1 specify that the models to use are in directory /usr/local/vadr-models-mpxv-1.4.2-1

The -r option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see this page

The -s option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see this page



Reference

  • The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3

  • A submitted manuscript describing changes made to VADR for faster SARS-CoV-2 annotation which are also applied to MPXV sequences in the sequences above: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. bioRxiv (2022). https://doi.org/10.1101/2022.04.25.489427


Questions, comments, feature or model requests? Send a mail to [email protected].

⚠️ **GitHub.com Fallback** ⚠️