Datasources - Integrative-Transcriptomics/Nextstrain-TrepoGen GitHub Wiki
In Nextstrain-TrepoGen, unlike Nextstrain Datasets which represent actual instances that can be viewed, we designate the combination of variants (VCF files), the reference genome, and the reference genome annotation as a Datasource
. Variants are stored as subsets, i.e. files containing, for example, only single nucleotide variants (SNVs), SNVs as well as insertions and deletions (InDels), or only a portion of regions due to masked positions. Each data source and its variant subsets can be part of multiple Nextstrain builds and, accordingly, used in multiple Nextstrain Datasets.
Our data sources are not included directly in this repository, but will likely be hosted on Zenodo or a comparable platform soon. Currently, we maintain the following data sources:
TPASS-2930
TPASS-2930: a high-resolution dataset comprising 2,930 Treponema pallidum (ssp. pallidum, pertenue and endemicum) samples genotyped against the SS14 reference genome (NC_021508.1).
- For Nextstrain builds utilising this dataset, low-quality samples with a mean coverage below 3× or with ambiguous (no-call) positions in more than 20% of the genome are masked to reduce the bias introduced by low-quality data (
n = 377
). - Accompanying metadata provides epidemiological context and quality metrics for each sample, including the sampling date, country and region, the designated subspecies, and for ssp. pallidum, the Nichols or SS14 lineage. It also includes the mean coverage and ambiguity (N or no-call positions) as a percentage.
- The data source comprises two subsets of variants: one containing only single nucleotide variants (SNVs), and the other containing both SNVs and insertions/deletions (InDels).
TPASS-308
TPASS-308 constitutes a representative dataset of 308 Treponema pallidum (ssp. pallidum, pertenue and endemicum) samples genotyped against the SS14 reference genome (NC_021508.1).
- Accompanying metadata provides epidemiological context and quality metrics for each sample, including the sampling date, country and region, the designated subspecies, and for ssp. pallidum, the Nichols or SS14 lineage. It also includes the mean coverage and ambiguity (N or no-call positions) as a percentage.
- The data source comprises two subsets of variants: one containing only single nucleotide variants (SNVs), and the other containing both SNVs and insertions/deletions (InDels).