EVE Project Data - giffordlabcvr/Parvovirus-GLUE GitHub Wiki

EVE Sequences

EVE sequences were recovered from whole genome sequence (WGS) assemblies via database-integrated genome screening (DIGS) using the DIGS tool.

All data pertaining to this screen are included in this repository.

  • The complete list of vertebrate genomes screened can be found here.

  • The complete list of invertebrate genomes screened can be found here.

  • The set of parvovirus polypeptide sequences used as probes can be found here.

  • The final set of parvovirus and EPV polypeptide sequences used as references can be found here.

  • Input parameters for screening using the DIGS tool can be found here.


EVE Reference Sequences

We reconstructed reference sequences for EPVs using alignments of EPV sequences derived from the same initial germline colonisation event - i.e. orthologous elements in distinct species, and paralogous elements that have arisen via intragenomic duplication of EPV sequences.

Tabular data summarising EPV loci can be found at the following links:

  1. Amdoparvoviruses
  2. Erythyroparvoviruses
  3. Dependoparvoviruses
  4. Protoparvoviruses
  5. Ichthamaparvoviruses

Consensus/reference nucleotide sequences (FASTA format) for EPV loci can be found at the following links/directories:

  1. Amdoparvoviruses
  2. Erythyroparvoviruses
  3. Dependoparvoviruses
  4. Protoparvoviruses
  5. Ichthamaparvoviruses

EPV Nomenclature

We have applied a systematic approach to naming EPVs, following a convention developed for endogenous retroviruses (ERVs). Each element was assigned a unique identifier (ID) constructed from a defined set of components.

EPV Nomenclature

The first component is the classifier ‘EPV’ (endogenous parvovirus element).

The second component is a composite of two distinct subcomponents separated by a period:

(i) the name of EPV group;
(ii) a numeric ID that uniquely identifies the insertion. The numeric ID is an integer identifying a unique insertion locus that arose as a consequence of an initial germline infection. Thus, orthologous copies in different species are given the same number.

The third component of the ID defines the set of host species in which the ortholog occurs.

This systematic naming approach facilitates clear identification and comparison of EVEs across different species and research contexts.

Please note the following:

  1. EVEs were assigned to virus taxonomic groups as accurately as possible based on phylogenetic/genomic analysis. For EVEs that could not be confidently assigned to a subgroup, the lowest taxonomic rank possible for the EVE type is given (i.e. family).
  2. We grouped sets of orthologous EVEs using shared numeric IDs. However, some orthologous relationships might have been missed, and some EVEs may have been incorrectly grouped as orthologs when they are actually distinct, paralogous loci.
  3. When EVEs occur in a single species, the corresponding Latin binomial species name is provided. When EVEs occur as orthologs in multiple species, we provide the taxonomic name of the species group. If the species set corresponds to an unranked clade, we use the name of the closest named group at a lower rank and add the abbreviation 'UR' (unranked) to indicate that no named clade perfectly captures the range of species in which the EVE is found.

EPV and EPV-Parvovirus Phylogenies

We used GLUE to implement an automated process for deriving midpoint rooted, annotated trees from the EPV-containing alignments included in our project, to reconstruct the evolutionary relationships between EPVs and related viruses.

Trees were reconstructed at distinct taxonomic levels:

  1. Recursively populated root phylogeny (Rep)
  2. Genus-level phylogenies
  3. EPV lineage-level phylogenies