Influenza annotation - ncbi/vadr GitHub Wiki


How to annotate influenza virus sequences with VADR:

The steps below explain how to use VADR for flu sequence validation and annotation. Testing of VADR for flu is still ongoing and these recommendations are subject to change.

GenBank is not currently using VADR to automatically process flu sequence submissions like it does for SARS-CoV-2. GenBank flu sequence submissions are currently validated and annotated using the FLAN annotation tool. Our testing indicates that VADR and FLAN give identical results on most flu sequences.

Steps for using VADR for flu annotation:

  1. Download and install the latest version of VADR (v1.6.3 or later), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.6.3 docker image created by Curtis Kapsak (docker image names: staphb/vadr:1.6.3 and staphb/vadr:latest), available on dockerhub and quay. A brief README for the docker image is here.

  2. Download the latest flu VADR models (version 1.6.3-2, gzipped tarball) from here, unpack them (e.g. tar xfz <tarball.gz>). Note the path to the directory name created (<flu-models-dir-path>) for step 3. (If you are using the docker image you can skip this step as the flu VADR models are included.)

  3. Remove terminal ambiguous nucleotides from your input fasta sequence file using the fasta-trim-terminal-ambigs.pl script in $VADRSCRIPTSDIR/miniscripts/. GenBank processing of viral sequences typically includes removing ambiguous nucleotides from the beginning and ends of sequences, and also removing sequences that are less than 60nt (after trimming).

    WARNING: the fasta-trim-terminal-ambigs.pl script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.

    To remove terminal ambiguous nucleotides from your sequence file <input-fasta-file> and to remove short sequences to create a new trimmed file <trimmed-fasta-file>, execute:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 60 <input-fasta-file> > <trimmed-fasta-file>
  1. Run the v-annotate.pl program on an input trimmed fasta file with flu sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).
v-annotate.pl --split --cpu 8 -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu --mdir <flu-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>

This section shows output from an example v-annotate.pl annotation of three flu sequences GenBank using the above command and options.

The fasta file of those three sequences can be downloaded here.

(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)

For this example, the flu model directory is in /usr/local/vadr-models-flu-1.6.3-2 and the pretrim.flu.3.fa sequence file is in the current directory. We will create a new file flu.3.fa with trimmed sequences in the next step.

As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 60nt with the command:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 60 pretrim.flu.3.fa > flu.3.fa

Next, to annotate the trimmed sequences using the above v-annotate.pl options for flu, run the following command (scroll to the right to see full command), which will create a new directory called my3 into which VADR output files will be written.

v-annotate.pl --split --cpu 8 -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu --mdir /usr/local/vadr-models-flu-1.6.3 flu.3.fa my3

The options used are explained further below.

When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.6.3 (Dec 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Wed Dec 20 11:31:39 2023
# $VADRBIOEASELDIR:  /home/nawrocki/vadr-install-1.6.3/Bio-Easel
# $VADRBLASTDIR:     /home/nawrocki/vadr-install-1.6.3/ncbi-blast/bin
# $VADREASELDIR:     /home/nawrocki/vadr-install-1.6.3/infernal/binaries
# $VADRINFERNALDIR:  /home/nawrocki/vadr-install-1.6.3/infernal/binaries
# $VADRMODELDIR:     /home/nawrocki/vadr-install-1.6.3/vadr-models-calici
# $VADRSCRIPTSDIR:   /home/nawrocki/vadr-install-1.6.3/vadr
#
# sequence file:                                                                  flu.3.fa
# output directory:                                                               my3
# only consider ATG a valid start codon:                                          yes [--atgonly]
# specify that alert codes in <s> cause FAILure:                                  extrant5,extrant3 [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  flu [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /usr/local/vadr-models-flu-1.6.3 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature:    yes [--nomisc]
# turn off composition-based for blastx statistics with -comp_based_stats 0:      yes [--xnocomp]
# replace stretches of Ns with expected nts, where possible:                      yes [-r]
# split input file into chunks, run each chunk separately:                        yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch):            8 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:

# Validating input                                                                                          ... done. [    0.4 seconds]
# Splitting sequence file into chunks to run independently in parallel on 8 processors                      ... done. [    0.5 seconds]
# Executing 3 scripts in parallel on 8 processors to process 3 partition(s) of all 3 sequence(s)            ... 
#	   3 of    3 jobs finished (0.1 minutes spent waiting)
# done. [   10.1 seconds]
# Merging and finalizing output                                                                             ... done. [    0.9 seconds]

With the --split --cpu 8 options, the input fasta script is split up into chunks and runs v-annotate.pl separately on those chunks on 8 different CPUs in parallel. When all sequences are finished processing the main script merges the output together.

The v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     AY191501   fluB-seg6  -            1     0     1
2     CY005979   fluA-seg4  H13          1     1     0
3     NC_036618  fluD-seg4  -            1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            3     2     1
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                   per    num   num  long
#idx  code      failure  description            type  cases  seqs  description
#---  --------  -------  ------------------  -------  -----  ----  -----------
1     cdsstopn  yes*     CDS_HAS_STOP_CODON  feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
2     cdsstopp  yes*     CDS_HAS_STOP_CODON  feature      1     1  stop codon in protein-based alignment
#---  --------  -------  ------------------  -------  -----  ----  -----------

And finally a list of the output files created:

# Output printed to screen saved in:                              my3.vadr.log
# List of executed commands saved in:                             my3.vadr.cmd
# List and description of all output files saved in:              my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   my3.vadr.fail.tbl
# list of passing sequences saved in:                             my3.vadr.pass.list
# list of failing sequences saved in:                             my3.vadr.fail.list
# list of alerts in the feature tables saved in:                  my3.vadr.alt.list
# fasta file with passing sequences saved in:                     my3.vadr.pass.fa
# fasta file with failing sequences saved in:                     my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          my3.vadr.sqa
# per-sequence tabular classification summary file saved in:      my3.vadr.sqc
# per-feature tabular summary file saved in:                      my3.vadr.ftr
# per-model-segment tabular summary file saved in:                my3.vadr.sgm
# per-alert tabular summary file saved in:                        my3.vadr.alt
# alert count tabular summary file saved in:                      my3.vadr.alc
# per-model tabular summary file saved in:                        my3.vadr.mdl
# alignment doctoring tabular summary file saved in:              my3.vadr.dcr
# replaced stretches of Ns summary file (-r) saved in:            my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time:  00:00:12.12
#                hh:mm:ss
# 
[ok]

Note that all the output files will be in the newly created my3 directory. The Summary of classified sequences listed that two sequences passed and one failed. The file my3.vadr.pass.list, lists the two sequences that passed:

MF147925.1
ON166841.1

and my3.vadr.fail.list lists the sequence that failed:

MH606682.1

Also, FASTA-formatted sequence files for each the passing and failing sequences are my3.vadr.pass.fa and my3.vadr.fail.fa.

For the two sequences that passed, the annotation is available in the output my3.vadr.pass.tbl file and for the two sequences that failed the annotation is in the my3.vadr.fail.tbl file.

Here is the output for the first sequence in the my3.vadr.pass.tbl file:

>Feature MF147925.1
42	1739	gene
			gene	HA
42	1739	CDS
			product	hemagglutinin
			function	receptor binding and fusion protein
			protein_id	MF147925.1_1
42	95	sig_peptide
			protein_id	MF147925.1_1
96	1067	mat_peptide
			product	HA1
			protein_id	MF147925.1_1
1068	1736	mat_peptide
			product	HA2
			protein_id	MF147925.1_1
>Feature ON166841.1
<1	1936	gene
			gene	HEF
<1	1936	CDS
			product	hemagglutinin-esterase precursor
			codon_start	2
			protein_id	ON166841.1_1

And the sequence in the my3.vadr.fail.tbl:

>Feature MH606682.1
35	337	gene
			gene	NB
35	337	CDS
			product	NB protein
			protein_id	MH606682.1_1
42	1442	gene
			gene	NA
42	1442	CDS
			product	neuraminidase
			protein_id	MH606682.1_2

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:NB protein) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:108,M:108]; seq-coords:227..229:+; mdl-coords:239..241:+; mdl:AY191501;
ERROR: CDS_HAS_STOP_CODON: (CDS:NB protein) stop codon in protein-based alignment [-]; seq-coords:227..229:+; mdl-coords:239..241:+; mdl:AY191501;

Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.

Note that the end of the fail.tbl file lists ERRORs for MH606682.1. In this case, the NB protein coding region has a stop codon 108 nucleotides earlier than expected. To investigate issues like these further, it can be helpful to add the --keep option to v-annotate.pl which results in additional output files being created, including alignment files.

ERROR lines such as this are meant to highlight potential problems for manual review or regions of interest to the user, but they do not necessarily mean that there is a problem with the sequence.

The annotation information is also available in other files with different formats, such as the my3/my3.vadr.ftr file, which may be easier to parse for some applications.

How to interpret VADR alert/error messages and other output

For examples of alerts/errors and more information on how to interpret the VADR output related to those alerts in the output feature tables, .alt.list file and the GenBank submission portal detailed error report .tsv files, see this vadr documentation page.

See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.

You can find information on two papers on VADR (below).

Explanation of options used in example command

The options used in the above command are the recommended set of options for flu annotation. The options are each briefly explained in the table below. More information can be found here,

option explanation
--split split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
--cpu 8 for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 8 CPU workers (8 can be changed to <n1> with --cpu <n1>), requires --split
-r turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
--atgonly specify that 'ATG' is the only start codon that should be considered valid
--xnocomp turn off composition-based stats for blastx, which seems to increase the length of flu blastx alignments in some cases
--nomisc specify that features for failing sequences not be changed to misc_features in the output .tbl file
--alt_fail extrant5,extrant3 specify that extrant5 and extrant3 alerts which are reported when one or more additional nucleotides compared with the reference is detected at the 5' or 3' ends, cause a sequence to fail
--mkey flu use the model files with prefix flu in the directory from --mdir
--mdir /usr/local/vadr-models-flu-1.6.3-2 specify that the models to use are in directory /usr/local/vadr-models-flu-1.6.3-2

The -r option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see this page


Influenza VADR model library

As of December 20, 2023, the VADR model library for flu annotation (vadr-models-flu-1.6.3-2 model file) includes 70 flu models covering influenza A, B, C and D types. For each type/segment/subtype, the table below lists the accessions the models are derived from (70 total accessions) and the gene and CDS product names for each.

type seg subtype accession gene CDS product
A 1 - CY002079.1, CY103881.1, CY125942.1 PB2 polymerase PB2
A 2 - CY003646.1, CY103882.1, CY125943.1 PB1, PB1-F2 polymerase PB1, PB1-F2 protein
A 3 - CY003645.1, CY103883.1, CY125944.1 PA, PA-X polymerase PA, PA-X protein
A 4 H1 CY000449.2 HA hemagglutinin
A 4 H2 CY003907.1 HA hemagglutinin
A 4 H3 CY002000.1 HA hemagglutinin
A 4 H4 CY004847.1 HA hemagglutinin
A 4 H5 DQ864721.1 HA hemagglutinin
A 4 H6 DQ376635.1 HA hemagglutinin
A 4 H7 CY006037.1 HA hemagglutinin
A 4 H8 CY005970.1 HA hemagglutinin
A 4 H9 CY004642.1 HA hemagglutinin
A 4 H10 CY006001.1 HA hemagglutinin
A 4 H11 CY006005.1 HA hemagglutinin
A 4 H12 CY006008.1 HA hemagglutinin
A 4 H13 CY005979.1 HA hemagglutinin
A 4 H14 M35997.1 HA hemagglutinin
A 4 H15 CY006034.1 HA hemagglutinin
A 4 H16 AY684891.1 HA hemagglutinin
A 4 H17 CY103884.1 HA hemagglutinin
A 4 H18 CY125945.1 HA hemagglutinin
A 4 H19 ON637239.1 HA hemagglutinin
A 5 - CY006079.1, CY103885.1, CY125946.1 NP nucleocapsid protein
A 6 N1 CY002538.1 NA neuraminidase
A 6 N2 CY002010.1 NA neuraminidase
A 6 N3 CY005890.1 NA neuraminidase
A 6 N4 CY005359.1 NA neuraminidase
A 6 N5 CY004429.1 NA neuraminidase
A 6 N6 CY005641.1 NA neuraminidase
A 6 N7 CY004435.1 NA neuraminidase
A 6 N8 CY004056.1 NA neuraminidase
A 6 N9 CY004131.1 NA neuraminidase
A 6 N10 CY103886.1 NA neuraminidase
A 6 N11 CY125947.1 NA neuraminidase-like protein
A 7 - CY002009.1, CY103887.1, CY125948.1 M2, M1 matrix protein 2, matrix protein 1
A 8 - CY002284.1, CY103888.1, CY125949.1 NEP, NS1 nuclear export protein, nonstructural protein 1
B 1 - EF626642.1 PB1 polymerase PB1
B 2 - AY504599.1 PB2 polymerase PB2
B 3 - EF626633.1 PA polymerase PA
B 4 - AF387493.1 HA hemagglutinin
B 5 - EF626631.1 NP nucleoprotein
B 6 - AY191501.1 NB, NA NB protein, neuraminidase
B 7 - AY504605.1 M1, BM2 matrix protein 1, BM2 protein
B 8 - AY504614.1 NEP, NS1 nuclear export protein, nonstructural protein 1
C 1 - NC_006307.2 PB2 polymerase PB2
C 2 - NC_006308.2 PB1 polymerase PB1
C 3 - NC_006309.2 P3 polymerase P3
C 4 - NC_006310.2 HE hemagglutinin-esterase
C 5 - NC_006311.1 NP nucleoprotein
C 6 - NC_006312.2 M1, CM2 matrix protein 1, CM2 protein
C 7 - NC_006306.2 NEP, NS1 nonstructural protein 2, nonstructural protein 1
D 1 - NC_036616.1 PB2 polymerase PB2
D 2 - NC_036615.1 PB1 polymerase PB1
D 3 - NC_036619.1 P3 polymerase 3
D 4 - NC_036618.1 HEF hemagglutinin-esterase precursor
D 5 - NC_036617.1 NP nucleoprotein
D 6 - NC_036620.1 P42 P42
D 7 - NC_036621.1 NS2, NS1 nonstructural protein 2, nonstructural protein 1

Most of the flu VADR models are based on FLAN reference sequences

Currently, and for the past several years, influenza sequence submissions to GenBank are automatically processed by FLAN (FLu ANotation tool), with reference below. The VADR flu models (v1.6.3-2) were derived from the FLAN reference sequences with some changes and additions to improve VADR performance. There is more information on this in the 00NOTES.txt and mapping-flan-to-genbank/00NOTES.txt files included in the vadr-models-flu-1.6.3-2.tar.gz gzipped tarball. We are also preparing a manuscript describing the use of VADR for flu annotation that will include a discussion of how VADR models were created based on FLAN, and a comparison of flu annotation results using the two tools.

Changes made to covariance models (flu.cm)

For two models, additional steps (besides running VADR's v-build.pl program) were needed to create optimally performing covariance models in the flu.cm file. The model tarball file includes the two .stk files used to build those models. These alignments were used as input the the cmbuild program of Infernal to create the models in flu.cm:

alignment file explanation
CY006079.2.stk training alignment used to build the CY006079 (fluA-seg5) model in flu.cm. Includes 2 sequences CY006079.1 and CY006079.1-1542-ins-a, which is identical to CY006079.1 with an insertion of a single 'a' after position 1542. Building a model from this two sequence alignment instead of only CY006079 results in a model less likely to get spurious alerts related to the stop codon of the NP CDS.
CY005970.2.stk training alignment used to build the CY005970 (fluA-seg4) model in flu.cm. Includes 2 sequences CY005970.1 and CY005970.1-g1719a-1721del, which is identical to CY005970.1 with a substitution at position 1719 ('a' for 'g') and a deletion at position 1721. Building a model from this two sequence alignment instead of only CY005970 results in a model less likely to give an alert for an invalid stop codon (e.g. LC339539.1).


Reference

  • There is a preprint on using VADR for influenza annotation on bioRxiv: Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki; Influenza sequence validation and annotation using VADR. https://www.biorxiv.org/content/10.1101/2024.03.21.585980v1

  • The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3

  • The following article describes changes made to VADR for faster SARS-CoV-2 annotation, including the -r option: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genom Bioinform. Vol 5, Issue 1 (2023) https://doi.org/10.1093/nargab/lqad002

  • FLAN reference: Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W280-4. doi:10.1093/nar/gkm354. Epub 2007 Jun 1. PMID: 17545199; PMCID: PMC1933127.


Questions, comments, feature or model requests? Send a mail to [email protected].

⚠️ **GitHub.com Fallback** ⚠️