Project Data - giffordlabcvr/Hepadnaviridae-GLUE GitHub Wiki

Genome Features

Hepadnaviruses have a small, circular DNA genome ~3 kilobases (Kb) in length. The genome is characterized by a highly streamlined organization incorporating extensive gene overlap - the open reading frame (ORF) encoding the viral polymerase (P) protein occupies most of the genome and typically overlaps at least one of the ORFs encoding the core (C), and surface (S) proteins.

Hepadnavirus-GLUE incorporates a standard set of hepadnavirus genome features.

Reference Sequences & Metadata

The sequence data in this project are organized into multiple distinct sources. Each source contains data in either GenBank XML or plain FASTA format. The type of data is indicated by the name of the source (all GenBank XML sources contain 'ncbi' in the name).

GenBank XML files are imported into this project directly from NCBI GenBank using an appropriately configured version of GLUE's GenBank importer module. The core Hepadnavirus-GLUE project contains a single NCBI-derived source - ncbi-refseqs - that contains 'master reference' genome sequences for each hepadnavirus species included in this project.

Hepadnavirus host species

Diverse Hepadnavirus Host Species Left to right: Recent research has identified divergent hepadnaviruses in: (i) icefish and (ii) spiny lizards. Viruses closely related to hepatitis B virus (HBV), which infects humans, have been identified in a wide range of mammals including (iii) woolly monkeys and (iv) duikers.

Hepadnavirus-GLUE contains reference sequences for all known hepadnavirus species. For each hepadnaviral genus, we defined a 'master' reference sequence, as follows:

Orthohepadnavirus: Hepatitis B virus, strain ayw (NC_003977)
Avihepadnavirus: Duck hepatitis B virus, isolate DHBVQCA34 (NC_001344)
Herpetohepadnavirus: Tibetan frog hepatitis B virus, isolate 243398 (NC_030446)
Metahepadnavirus: Bluegill hepatitis B virus (NC_030445)
Parahepadnavirus: White sucker hepadnavirus, isolate RR173 (NC_027922)

We defined the locations of genome features (see above) on master reference sequences.

Reference sequence are linked to auxiliary data in tabular format.

Multiple Sequence Alignments

Multiple sequence alignments constructed in this study are linked together using GLUE's alignment tree data structure. Alignments in the project include:

A root alignment constructed to represent proposed homologies between representative members of major hepadnavirus lineages.
Genus-level alignments constructed to represent proposed homologies between the genomes of representative members of specific hepadnavirus genera.

Multiple Sequence Alignment tree in Hepadnaviridae-GLUE

GLUE projects have the option to use a data structure called an alignment tree to link multiple sequence alignments (MSAs) representing different taxonomic levels. This approach has been used in Hepadnaviridae-GLUE.

Alignment tree concept

The schematic above shows the alignment tree structure in Hepadnaviridae-GLUE. We constructed "tip" alignments at the genus level, as well as a family-level alignment representing the Hepadnaviridae family, located at an internal node in the tree. Additionally, a root alignment includes the recently described "nackednaviruses" as an outgroup.

For lower taxonomic levels (i.e., within and below the genus level), we aligned complete coding sequences. For higher taxonomic levels (i.e., at the root), only the most conserved gene (the viral polymerase) was aligned. The alignment tree links these alignments via a set of common reference sequences. The root alignment contains all reference sequences, while all child alignments inherit at least one reference sequence from their immediate parent. This ensures that all alignments are connected through a set of master reference sequences.

One advantage of this structure is its ease of maintenance. For instance, the node representing the root of Hepadnaviridae contains only the master reference sequences for each genus—just five sequences in total—making it very manageable. However, what if we want to extract an alignment or build a tree at the family level that includes all taxa?

We can use the alignment tree to achieve this. Below is an example of how this works:

On the GLUE console, first list the members of the relevant alignment:

  Mode path: /
  GLUE> project hepadnaviridae alignment AL_Hepadnaviridae list member 
  +===================+======================+=====================+
  |  alignment.name   | sequence.source.name | sequence.sequenceID |
  +===================+======================+=====================+
  | AL_Hepadnaviridae | ncbi-refseqs         | NC_001344           |
  | AL_Hepadnaviridae | ncbi-refseqs         | NC_003977           |
  | AL_Hepadnaviridae | ncbi-refseqs         | NC_027922           |
  | AL_Hepadnaviridae | ncbi-refseqs         | NC_030445           |
  | AL_Hepadnaviridae | ncbi-refseqs         | NC_030446           |
  +===================+======================+=====================+
  AlignmentMembers found: 5

As expected, there are only five members. Now let's examine how the AL_Hepadnaviridae alignment is linked to other alignments by using the list children command:

   Mode path: /
   GLUE> project hepadnaviridae alignment AL_Hepadnaviridae list children 
   +========================+==========================+
   |          name          |     refSequence.name     |
   +========================+==========================+
   | AL_Avihepadnavirus     | REF_Avi_MASTER_DHBV      |
   | AL_Herpetohepadnavirus | REF_Herpeto_MASTER_tfHBV |
   | AL_Metahepadnavirus    | REF_Meta_MASTER_bgHBV    |
   | AL_Orthohepadnavirus   | REF_Ortho_MASTER_HBV     |
   | AL_Parahepadnavirus    | REF_Para_MASTER_wsHBV    |
   +========================+==========================+
   Alignments found: 5

As shown, the Hepadnaviridae alignment is linked to five "child" alignments, each representing a different hepadnavirus genus. The table lists the constraining reference sequence for each alignment.

Since these alignments are linked, we can use GLUE's fastaAlignmentExporter module to link across all alignments and export a codon-level alignment containing all taxa. Here's how you can do this:

   GLUE> project hepadnaviridae module fastaAlignmentExporter
   OK
   Mode path: /project/hepadnaviridae/module/fastaAlignmentExporter
   GLUE> export AL_Hepadnaviridae -r REF_Ortho_MASTER_HBV -f Polymerase -a -e -c -p

Explanation of the Command:

-r: Specifies the reference sequence (in this case, the project master reference, HBV).
-f: Selects a feature within the specified reference (here, the polymerase gene reading frame).
-a: Means "all taxa" — used instead of the -w option to apply the command to all taxa.
-e: Excludes empty rows, preventing taxa that do not span the selected feature from being included in the exported alignment.
-c: This is a crucial option for our example, meaning "export recursively," which includes taxa from child alignments.
-p: Outputs the result to the console for preview. To export the alignment as a file, use the -o option with a file name, as shown below:

Phylogenetic Trees

We used GLUE to implement an automated process for deriving midpoint rooted, annotated trees from the alignments included in our project.

Trees were constructed at distinct taxonomic levels:

Recursively populated root phylogenies.
Genus-level phylogenies.