Workflows - Integrative-Transcriptomics/Nextstrain-TrepoGen GitHub Wiki

Nextstrain-TrepoGen implements two distinct workflows that can be used to generate different types of dataset, each with specific analytical objectives.

Genome

The Genome Workflow directly processes whole-genome variant calls in VCF format. It is intended to produce accurate phylogenies that are useful for tracing subspecies and population structures, and for deriving epidemiological insights, global classifications and geographic modelling. It provides rules for masking positions for phylogenetic tree building. The workflow implements additional rules for applying drug resistance mutation annotation and conducting clade assignment.

For our datasets, we mask known recombinant loci in Treponema pallidum, thereby reducing the homoplasic effect on the tree topology. Furthermore, we annotate macrolide resistance mutations induced by single nucleotide variants (SNVs) in the 23S ribosomal RNA (rRNA) of Treponema pallidum, as well as a prototypic clade assignment scheme based on hierarchical clustering.

Gene

The Gene Workflow focuses analysis on a single gene of interest. This enables functional analysis, for example, the discovery of putative vaccine targets. The workflow is not applied directly to variant calls, but a preprocessing step involving the following is applied:

  • Generating target gene sequence alignments using MUSIAL from the variant calls. Currently, all samples with an identical sequence to the reference sequence are excluded.
  • Adapting the reference annotation to the target gene to allow the extraction of gene sub-features (i.e. biological regions of interest) if they are included in the annotation.
  • Additional meta data annotation of each sample based on the sequence composition of selected gene sub-features.

For our datasets, we manually annotated protein topologies using either manually curated data or the DeepTMHMM tool for selected outer membrane proteins (OMPs) of interest. Our focus is on sequence composition typing of extracellular loop (ECL) regions.