EVE Project Data - giffordlabcvr/Hepadnaviridae-GLUE GitHub Wiki

Sequence Data

These are the raw data generated by database-integrated genome screening (DIGS). The tabular file contains information about the genomic location of each EVE. EVEs were classified by comparison to a reference library of polypeptide sequences designed to represent the known diversity of hepadnaviruses - this includes extinct lineages represented only by endogenous viral elements (EVEs).

These data were obtained via DIGS performed in vertebrate genome assemblies downloaded from NCBI Genome (2020-07-15).

Raw data about the EVEs in tabular format can be found here.

Nucleotide level data in FASTA format (individual files) can be found here.


Reference Sequence Data

We constructed consensus sequences for hepadnaviral paleoviruses by aligning eHBV sequences derived from the same initial germline colonisation event—i.e. orthologs in distinct species, and paralogs that have arisen via intragenomic duplication.

Reference sequence data in tabular format are here.

The reference sequences in FASTA format are here.


Multiple Sequence Alignments

The Hepadnavirus-GLUE project contains multiple sequence alignments linking all known eHBV and virus sequences.

Exported alignments can be accessed at the links below.

Nucleotide level data:

Taxonomic group Full-length eHBV Core codons Surface codons Pol codons
Avihepadnavirus FASTA MSA FASTA MSA FASTA MSA FASTA MSA
Herpetohepadnavirus FASTA MSA FASTA MSA FASTA MSA FASTA MSA
Metahepadnavirus FASTA MSA FASTA MSA FASTA MSA FASTA MSA

Protein level data:

Taxonomic group Core AA Surface AA Pol AA
Avihepadnavirus FASTA MSA FASTA MSA FASTA MSA
Herpetohepadnavirus FASTA MSA FASTA MSA FASTA MSA
Metahepadnavirus FASTA MSA FASTA MSA FASTA MSA

EVE Nomenclature

Nomenclature for eHBVs

We have applied a systematic approach to naming endogenous hepadnaviruses (eHBVs), following a convention developed for endogenous retroviruses. Each eHBV locus was assigned a unique identifier (ID) constructed from several components, each of which refers to a property of the locus.

eHBV Nomenclature

The first component is the classifier ‘eHBV’ (endogenous hepatitis B virus/endogenous hepadnavirus).

The second component is a composite of two distinct subcomponents separated by a period: (i) the name of eHBV group; (ii) a numeric ID that uniquely identifies the insertion. The numeric ID is an integer that identifies a unique insertion locus that arose as a consequence of an initial germline infection. Thus, orthologous copies in different species are given the same number.

Where an EVE sequence is thought to have been duplicated within the germline following it's initial incorporation (e.g. via segmental duplication or transposition) we have appended an additional 'duplicate id' to the numeric ID, separated by a period. Please note that we have not yet resolved the orthologous relationships among sets of eHBV sequences belonging to multicopy eHBV lineages. We have therefore assigned unique duplicate IDs to each sequence within these lineages.

The third component of the ID defines the set of host species in which the ortholog occurs, or did occur prior to being deleted.