EVE Project Data - giffordlabcvr/Hepadnaviridae-GLUE GitHub Wiki

Sequence Data

These are the raw data generated by database-integrated genome screening (DIGS). The tabular file contains information about the genomic location of each EVE. EVEs were classified by comparison to a reference library of polypeptide sequences designed to represent the known diversity of hepadnaviruses - this includes extinct lineages represented only by endogenous viral elements (EVEs).

These data were obtained via DIGS performed in vertebrate genome assemblies downloaded from NCBI Genome (2020-07-15).

Raw data about the EVEs in tabular format can be found here.

Nucleotide level data in FASTA format (individual files) can be found here.

Reference Sequence Data

We constructed consensus sequences for hepadnaviral paleoviruses by aligning eHBV sequences derived from the same initial germline colonisation event—i.e. orthologs in distinct species, and paralogs that have arisen via intragenomic duplication.

Reference sequence data in tabular format are here.

The reference sequences in FASTA format are here.

Multiple Sequence Alignments

The Hepadnavirus-GLUE project contains multiple sequence alignments linking all known eHBV and virus sequences.

Exported alignments can be accessed at the links below.

Nucleotide level data:

Taxonomic group	Full-length eHBV	Core codons	Surface codons	Pol codons
Avihepadnavirus	FASTA MSA	FASTA MSA	FASTA MSA	FASTA MSA
Herpetohepadnavirus	FASTA MSA	FASTA MSA	FASTA MSA	FASTA MSA
Metahepadnavirus	FASTA MSA	FASTA MSA	FASTA MSA	FASTA MSA

Protein level data:

Taxonomic group	Core AA	Surface AA	Pol AA
Avihepadnavirus	FASTA MSA	FASTA MSA	FASTA MSA
Herpetohepadnavirus	FASTA MSA	FASTA MSA	FASTA MSA
Metahepadnavirus	FASTA MSA	FASTA MSA	FASTA MSA

EVE Nomenclature

Nomenclature for eHBVs

We use a systematic naming convention for endogenous hepadnaviruses (eHBVs), adapted from a framework established for endogenous retroviruses. Each eHBV locus is assigned a unique identifier (ID) that reflects key properties of the insertion.

eHBV Nomenclature

The ID consists of three components:

Classifier: The prefix 'eHBV' (endogenous hepatitis B virus/endogenous hepadnavirus).
Virus Group & Locus ID: A composite of:
- The name of the hepadnavirus taxonomic group from which the element derives.
- A numeric code uniquely identifying the insertion locus.
Host Species Set: A designation indicating the species in which the orthologous locus is present---or was present before deletion.

This standardized approach ensures clarity and consistency in referencing eHBV loci across different hosts and taxonomic contexts.