Influenza annotation - ncbi/vadr GitHub Wiki

How to annotate influenza virus sequences with VADR

Example annotation of flu sequences

How to interpret VADR alert and error messages
Explanation of options used in example annotation

Influenza VADR model library

Most of the flu VADR models are based on FLAN reference sequences
Changes made to covariance models (flu.cm)

How to annotate influenza virus sequences with VADR:

The steps below explain how to use VADR for flu sequence validation and annotation. Testing of VADR for flu is still ongoing and these recommendations are subject to change.

GenBank is not currently using VADR to automatically process flu sequence submissions like it does for SARS-CoV-2. GenBank flu sequence submissions are currently validated and annotated using the FLAN annotation tool. Our testing indicates that VADR and FLAN give identical results on most flu sequences.

Steps for using VADR for flu annotation:

Download and install the latest version of VADR (v1.6.3 or later), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.6.3 docker image created by Curtis Kapsak (docker image names: staphb/vadr:1.6.3 and staphb/vadr:latest), available on dockerhub and quay. A brief README for the docker image is here.
Download the latest flu VADR models (version 1.6.3-2, gzipped tarball) from here, unpack them (e.g. tar xfz <tarball.gz>). Note the path to the directory name created (<flu-models-dir-path>) for step 3. (If you are using the docker image you can skip this step as the flu VADR models are included.)
Remove terminal ambiguous nucleotides from your input fasta sequence file using the fasta-trim-terminal-ambigs.pl script in $VADRSCRIPTSDIR/miniscripts/. GenBank processing of viral sequences typically includes removing ambiguous nucleotides from the beginning and ends of sequences, and also removing sequences that are less than 60nt (after trimming).

WARNING: the fasta-trim-terminal-ambigs.pl script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.

To remove terminal ambiguous nucleotides from your sequence file <input-fasta-file> and to remove short sequences to create a new trimmed file <trimmed-fasta-file>, execute:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 60 <input-fasta-file> > <trimmed-fasta-file>

Run the v-annotate.pl program on an input trimmed fasta file with flu sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).

v-annotate.pl --split --cpu 8 -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu --mdir <flu-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>

Example annotation of flu sequences

This section shows output from an example v-annotate.pl annotation of three flu sequences GenBank using the above command and options.

The fasta file of those three sequences can be downloaded here.

(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)

For this example, the flu model directory is in /usr/local/vadr-models-flu-1.6.3-2 and the pretrim.flu.3.fa sequence file is in the current directory. We will create a new file flu.3.fa with trimmed sequences in the next step.

As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 60nt with the command:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 60 pretrim.flu.3.fa > flu.3.fa

Next, to annotate the trimmed sequences using the above v-annotate.pl options for flu, run the following command (scroll to the right to see full command), which will create a new directory called my3 into which VADR output files will be written.

v-annotate.pl --split --cpu 8 -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu --mdir /usr/local/vadr-models-flu-1.6.3 flu.3.fa my3

The options used are explained further below.

When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.6.3 (Dec 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Wed Dec 20 11:31:39 2023
# $VADRBIOEASELDIR:  /home/nawrocki/vadr-install-1.6.3/Bio-Easel
# $VADRBLASTDIR:     /home/nawrocki/vadr-install-1.6.3/ncbi-blast/bin
# $VADREASELDIR:     /home/nawrocki/vadr-install-1.6.3/infernal/binaries
# $VADRINFERNALDIR:  /home/nawrocki/vadr-install-1.6.3/infernal/binaries
# $VADRMODELDIR:     /home/nawrocki/vadr-install-1.6.3/vadr-models-calici
# $VADRSCRIPTSDIR:   /home/nawrocki/vadr-install-1.6.3/vadr
#
# sequence file:                                                                  flu.3.fa
# output directory:                                                               my3
# only consider ATG a valid start codon:                                          yes [--atgonly]
# specify that alert codes in <s> cause FAILure:                                  extrant5,extrant3 [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  flu [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /usr/local/vadr-models-flu-1.6.3 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature:    yes [--nomisc]
# turn off composition-based for blastx statistics with -comp_based_stats 0:      yes [--xnocomp]
# replace stretches of Ns with expected nts, where possible:                      yes [-r]
# split input file into chunks, run each chunk separately:                        yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch):            8 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:

# Validating input                                                                                          ... done. [    0.4 seconds]
# Splitting sequence file into chunks to run independently in parallel on 8 processors                      ... done. [    0.5 seconds]
# Executing 3 scripts in parallel on 8 processors to process 3 partition(s) of all 3 sequence(s)            ... 
#	   3 of    3 jobs finished (0.1 minutes spent waiting)
# done. [   10.1 seconds]
# Merging and finalizing output                                                                             ... done. [    0.9 seconds]

With the --split --cpu 8 options, the input fasta script is split up into chunks and runs v-annotate.pl separately on those chunks on 8 different CPUs in parallel. When all sequences are finished processing the main script merges the output together.

The v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     AY191501   fluB-seg6  -            1     0     1
2     CY005979   fluA-seg4  H13          1     1     0
3     NC_036618  fluD-seg4  -            1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            3     2     1
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                   per    num   num  long
#idx  code      failure  description            type  cases  seqs  description
#---  --------  -------  ------------------  -------  -----  ----  -----------
1     cdsstopn  yes*     CDS_HAS_STOP_CODON  feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
2     cdsstopp  yes*     CDS_HAS_STOP_CODON  feature      1     1  stop codon in protein-based alignment
#---  --------  -------  ------------------  -------  -----  ----  -----------

And finally a list of the output files created:

# Output printed to screen saved in:                              my3.vadr.log
# List of executed commands saved in:                             my3.vadr.cmd
# List and description of all output files saved in:              my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   my3.vadr.fail.tbl
# list of passing sequences saved in:                             my3.vadr.pass.list
# list of failing sequences saved in:                             my3.vadr.fail.list
# list of alerts in the feature tables saved in:                  my3.vadr.alt.list
# fasta file with passing sequences saved in:                     my3.vadr.pass.fa
# fasta file with failing sequences saved in:                     my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          my3.vadr.sqa
# per-sequence tabular classification summary file saved in:      my3.vadr.sqc
# per-feature tabular summary file saved in:                      my3.vadr.ftr
# per-model-segment tabular summary file saved in:                my3.vadr.sgm
# per-alert tabular summary file saved in:                        my3.vadr.alt
# alert count tabular summary file saved in:                      my3.vadr.alc
# per-model tabular summary file saved in:                        my3.vadr.mdl
# alignment doctoring tabular summary file saved in:              my3.vadr.dcr
# replaced stretches of Ns summary file (-r) saved in:            my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time:  00:00:12.12
#                hh:mm:ss
# 
[ok]

Note that all the output files will be in the newly created my3 directory. The Summary of classified sequences listed that two sequences passed and one failed. The file my3.vadr.pass.list, lists the two sequences that passed:

MF147925.1
ON166841.1

and my3.vadr.fail.list lists the sequence that failed:

MH606682.1

Also, FASTA-formatted sequence files for each the passing and failing sequences are my3.vadr.pass.fa and my3.vadr.fail.fa.

For the two sequences that passed, the annotation is available in the output my3.vadr.pass.tbl file and for the two sequences that failed the annotation is in the my3.vadr.fail.tbl file.

Here is the output for the first sequence in the my3.vadr.pass.tbl file:

>Feature MF147925.1
42	1739	gene
			gene	HA
42	1739	CDS
			product	hemagglutinin
			function	receptor binding and fusion protein
			protein_id	MF147925.1_1
42	95	sig_peptide
			protein_id	MF147925.1_1
96	1067	mat_peptide
			product	HA1
			protein_id	MF147925.1_1
1068	1736	mat_peptide
			product	HA2
			protein_id	MF147925.1_1
>Feature ON166841.1
<1	1936	gene
			gene	HEF
<1	1936	CDS
			product	hemagglutinin-esterase precursor
			codon_start	2
			protein_id	ON166841.1_1

And the sequence in the my3.vadr.fail.tbl:

>Feature MH606682.1
35	337	gene
			gene	NB
35	337	CDS
			product	NB protein
			protein_id	MH606682.1_1
42	1442	gene
			gene	NA
42	1442	CDS
			product	neuraminidase
			protein_id	MH606682.1_2

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:NB protein) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:108,M:108]; seq-coords:227..229:+; mdl-coords:239..241:+; mdl:AY191501;
ERROR: CDS_HAS_STOP_CODON: (CDS:NB protein) stop codon in protein-based alignment [-]; seq-coords:227..229:+; mdl-coords:239..241:+; mdl:AY191501;

Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.

Note that the end of the fail.tbl file lists ERRORs for MH606682.1. In this case, the NB protein coding region has a stop codon 108 nucleotides earlier than expected. To investigate issues like these further, it can be helpful to add the --keep option to v-annotate.pl which results in additional output files being created, including alignment files.

ERROR lines such as this are meant to highlight potential problems for manual review or regions of interest to the user, but they do not necessarily mean that there is a problem with the sequence.

The annotation information is also available in other files with different formats, such as the my3/my3.vadr.ftr file, which may be easier to parse for some applications.

How to interpret VADR alert/error messages and other output

For examples of alerts/errors and more information on how to interpret the VADR output related to those alerts in the output feature tables, .alt.list file and the GenBank submission portal detailed error report .tsv files, see this vadr documentation page.

See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.

You can find information on two papers on VADR (below).

Explanation of options used in example command

The options used in the above command are the recommended set of options for flu annotation. The options are each briefly explained in the table below. More information can be found here,

option	explanation
`--split`	split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to `<n>` by adding option `--nkb <n>`
`--cpu 8`	for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 8 CPU workers (8 can be changed to `<n1>` with `--cpu <n1>`), requires `--split`
`-r`	turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
`--atgonly`	specify that 'ATG' is the only start codon that should be considered valid
`--xnocomp`	turn off composition-based stats for blastx, which seems to increase the length of flu blastx alignments in some cases
`--nomisc`	specify that features for failing sequences not be changed to `misc_feature`s in the output .tbl file
`--alt_fail extrant5,extrant3`	specify that `extrant5` and `extrant3` alerts which are reported when one or more additional nucleotides compared with the reference is detected at the 5' or 3' ends, cause a sequence to fail
`--mkey flu`	use the model files with prefix `flu` in the directory from `--mdir`
`--mdir /usr/local/vadr-models-flu-1.6.3-2`	specify that the models to use are in directory /usr/local/vadr-models-flu-1.6.3-2

The -r option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see this page

Influenza VADR model library

As of December 20, 2023, the VADR model library for flu annotation (vadr-models-flu-1.6.3-2 model file) includes 70 flu models covering influenza A, B, C and D types. For each type/segment/subtype, the table below lists the accessions the models are derived from (70 total accessions) and the gene and CDS product names for each.

type	seg	subtype	accession	gene	CDS product
A	1	-	`CY002079.1`, `CY103881.1`, `CY125942.1`	PB2	polymerase PB2
A	2	-	`CY003646.1`, `CY103882.1`, `CY125943.1`	PB1, PB1-F2	polymerase PB1, PB1-F2 protein
A	3	-	`CY003645.1`, `CY103883.1`, `CY125944.1`	PA, PA-X	polymerase PA, PA-X protein
A	4	H1	`CY000449.2`	HA	hemagglutinin
A	4	H2	`CY003907.1`	HA	hemagglutinin
A	4	H3	`CY002000.1`	HA	hemagglutinin
A	4	H4	`CY004847.1`	HA	hemagglutinin
A	4	H5	`DQ864721.1`	HA	hemagglutinin
A	4	H6	`DQ376635.1`	HA	hemagglutinin
A	4	H7	`CY006037.1`	HA	hemagglutinin
A	4	H8	`CY005970.1`	HA	hemagglutinin
A	4	H9	`CY004642.1`	HA	hemagglutinin
A	4	H10	`CY006001.1`	HA	hemagglutinin
A	4	H11	`CY006005.1`	HA	hemagglutinin
A	4	H12	`CY006008.1`	HA	hemagglutinin
A	4	H13	`CY005979.1`	HA	hemagglutinin
A	4	H14	`M35997.1`	HA	hemagglutinin
A	4	H15	`CY006034.1`	HA	hemagglutinin
A	4	H16	`AY684891.1`	HA	hemagglutinin
A	4	H17	`CY103884.1`	HA	hemagglutinin
A	4	H18	`CY125945.1`	HA	hemagglutinin
A	4	H19	`ON637239.1`	HA	hemagglutinin
A	5	-	`CY006079.1`, `CY103885.1`, `CY125946.1`	NP	nucleocapsid protein
A	6	N1	`CY002538.1`	NA	neuraminidase
A	6	N2	`CY002010.1`	NA	neuraminidase
A	6	N3	`CY005890.1`	NA	neuraminidase
A	6	N4	`CY005359.1`	NA	neuraminidase
A	6	N5	`CY004429.1`	NA	neuraminidase
A	6	N6	`CY005641.1`	NA	neuraminidase
A	6	N7	`CY004435.1`	NA	neuraminidase
A	6	N8	`CY004056.1`	NA	neuraminidase
A	6	N9	`CY004131.1`	NA	neuraminidase
A	6	N10	`CY103886.1`	NA	neuraminidase
A	6	N11	`CY125947.1`	NA	neuraminidase-like protein
A	7	-	`CY002009.1`, `CY103887.1`, `CY125948.1`	M2, M1	matrix protein 2, matrix protein 1
A	8	-	`CY002284.1`, `CY103888.1`, `CY125949.1`	NEP, NS1	nuclear export protein, nonstructural protein 1

B	1	-	`EF626642.1`	PB1	polymerase PB1
B	2	-	`AY504599.1`	PB2	polymerase PB2
B	3	-	`EF626633.1`	PA	polymerase PA
B	4	-	`AF387493.1`	HA	hemagglutinin
B	5	-	`EF626631.1`	NP	nucleoprotein
B	6	-	`AY191501.1`	NB, NA	NB protein, neuraminidase
B	7	-	`AY504605.1`	M1, BM2	matrix protein 1, BM2 protein
B	8	-	`AY504614.1`	NEP, NS1	nuclear export protein, nonstructural protein 1

C	1	-	`NC_006307.2`	PB2	polymerase PB2
C	2	-	`NC_006308.2`	PB1	polymerase PB1
C	3	-	`NC_006309.2`	P3	polymerase P3
C	4	-	`NC_006310.2`	HE	hemagglutinin-esterase
C	5	-	`NC_006311.1`	NP	nucleoprotein
C	6	-	`NC_006312.2`	M1, CM2	matrix protein 1, CM2 protein
C	7	-	`NC_006306.2`	NEP, NS1	nonstructural protein 2, nonstructural protein 1

D	1	-	`NC_036616.1`	PB2	polymerase PB2
D	2	-	`NC_036615.1`	PB1	polymerase PB1
D	3	-	`NC_036619.1`	P3	polymerase 3
D	4	-	`NC_036618.1`	HEF	hemagglutinin-esterase precursor
D	5	-	`NC_036617.1`	NP	nucleoprotein
D	6	-	`NC_036620.1`	P42	P42
D	7	-	`NC_036621.1`	NS2, NS1	nonstructural protein 2, nonstructural protein 1

Most of the flu VADR models are based on FLAN reference sequences

Currently, and for the past several years, influenza sequence submissions to GenBank are automatically processed by FLAN (FLu ANotation tool), with reference below. The VADR flu models (v1.6.3-2) were derived from the FLAN reference sequences with some changes and additions to improve VADR performance. There is more information on this in the 00NOTES.txt and mapping-flan-to-genbank/00NOTES.txt files included in the vadr-models-flu-1.6.3-2.tar.gz gzipped tarball. We are also preparing a manuscript describing the use of VADR for flu annotation that will include a discussion of how VADR models were created based on FLAN, and a comparison of flu annotation results using the two tools.

Changes made to covariance models (`flu.cm`)

For two models, additional steps (besides running VADR's v-build.pl program) were needed to create optimally performing covariance models in the flu.cm file. The model tarball file includes the two .stk files used to build those models. These alignments were used as input the the cmbuild program of Infernal to create the models in flu.cm:

alignment file	explanation
`CY006079.2.stk`	training alignment used to build the CY006079 (fluA-seg5) model in flu.cm. Includes 2 sequences CY006079.1 and CY006079.1-1542-ins-a, which is identical to CY006079.1 with an insertion of a single 'a' after position 1542. Building a model from this two sequence alignment instead of only CY006079 results in a model less likely to get spurious alerts related to the stop codon of the NP CDS.
`CY005970.2.stk`	training alignment used to build the CY005970 (fluA-seg4) model in flu.cm. Includes 2 sequences CY005970.1 and CY005970.1-g1719a-1721del, which is identical to CY005970.1 with a substitution at position 1719 ('a' for 'g') and a deletion at position 1721. Building a model from this two sequence alignment instead of only CY005970 results in a model less likely to give an alert for an invalid stop codon (e.g. LC339539.1).

Additional VADR documentation

Reference

There is a preprint on using VADR for influenza annotation on bioRxiv: Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki; Influenza sequence validation and annotation using VADR. https://www.biorxiv.org/content/10.1101/2024.03.21.585980v1
The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
The following article describes changes made to VADR for faster SARS-CoV-2 annotation, including the -r option: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genom Bioinform. Vol 5, Issue 1 (2023) https://doi.org/10.1093/nargab/lqad002
FLAN reference: Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W280-4. doi:10.1093/nar/gkm354. Epub 2007 Jun 1. PMID: 17545199; PMCID: PMC1933127.

Influenza annotation - ncbi/vadr GitHub Wiki

How to annotate influenza virus sequences with VADR

Example annotation of flu sequences

Influenza VADR model library

Additional VADR documentation

References

How to annotate influenza virus sequences with VADR:

Example annotation of flu sequences

How to interpret VADR alert/error messages and other output

Explanation of options used in example command

Influenza VADR model library

Most of the flu VADR models are based on FLAN reference sequences

Changes made to covariance models (`flu.cm`)

Additional VADR documentation

Reference

Questions, comments, feature or model requests? Send a mail to [email protected].

⚠️ GitHub.com Fallback ⚠️

Influenza annotation - ncbi/vadr GitHub Wiki

How to annotate influenza virus sequences with VADR

Example annotation of flu sequences

Influenza VADR model library

Additional VADR documentation

References

How to annotate influenza virus sequences with VADR:

Example annotation of flu sequences

How to interpret VADR alert/error messages and other output

Explanation of options used in example command

Influenza VADR model library

Most of the flu VADR models are based on FLAN reference sequences

Changes made to covariance models (flu.cm)

Additional VADR documentation

Reference

Questions, comments, feature or model requests? Send a mail to [email protected].

⚠️ **GitHub.com Fallback** ⚠️

Changes made to covariance models (`flu.cm`)

⚠️ GitHub.com Fallback ⚠️