About segregation statistics - zkstewart/psQTL GitHub Wiki
Euclidean Distance for QTL segregation
Euclidean distance (ED) is a measure of variant segregation that appears to outperform similar metrics including delta SNP or Gprime [1]. It has been mathematically formulated as:
The resulting value provides an indication of how much the two bulks differ at each variant or CNV, with large ED values indicating more segregation and values of 0
indicating an identical allele frequency between the two bulks. It is agnostic to parent sample genotyping as well as whether variants are biallellic or multiallelic. It is hence capable of working with datasets flexibly and considering variants that other software may filter out.
When interpreting ED values, note that:
- The maximum value that ED can obtain is approximately
1.414213562
prior to any power transformations. When applying the default program behaviour of--power 4
when plotting, this value can be a maximum of4
.- This is under circumstances where perfect allele segregation occurs.
- Under circumstances where one bulk is entirely one allele (e.g., A/A) and the other is entirely other alleles, heterozygosity (e.g., T/C) will receive a lower ED value than homozygosity (e.g., T/T).
If you want to experiment with how the ED formula operates, see the Excel file in the tests
folder of this repository.
sPLS-DA for QTL segregation
psQTL_proc.py splsda
will run "local PLS-DA" within non-overlapping windows on each chromosome. The idea for this is akin to "local PCA" whose utility has been previously demonstrated [2] when assessing population structure within localised regions of a genome.
This analysis involves unsupervised prediction of which bulk each sample belongs to, using only the variants or CNVs occurring within each window. This produces a Balanced Error Rate (BER) within each window, which ranges from 0.5
(i.e., prediction result is no better than random chance) to 0
(i.e., prediction result is perfectly correct). psQTL performs simple internal filtering to select features (i.e., SNPs, indels, or CNVs) from within each window based on how much information they contribute to the classification outcome.
After going through each window, we then run a final sPLS-DA across the whole genome using only features that passed filtration from each window. This algorithm will select the minimal number of features which maximally contribute to classification outcome of sample groupings. Each feature selected will, in theory, contribute a different kind of information to the classification outcome and will not be redundant. We can hence interpret its results as pointing to multiple QTLs whose combined inheritance best explains the phenotype segregation.
A final stage can occur when assessing features selected from 'call' (SNPs and indels) and 'depth' (CNVs) analyses. Referred to as integration, this step involves running sPLS-DA using the features that were selected when run separately for the 'call' and 'depth' analyses. Features selected during this step are the most informative with respect to classifying samples, and can point to potential instances where combined inheritance of SNPs, indels, and CNVs best explains the phenotype segregation. In many instances, we might expect integration to select features primarily from the 'depth' analysis, which suggests that CNV occurrence may be more associated with the phenotype segregation than SNPs or indels; the opposite may be true in other instances.
When plotting, psQTL converts BER into Balanced Accuracy (BA), which ranges from 1
(i.e., prediction result is perfectly correct) to 0
(i.e., prediction result is no better than random chance). This allow for intuitive interpretation of plots at a glance -- segregation statistics should peak in QTLs and decline to 0
in uninteresting regions. When variants segregate between bulks within a given window region, we expect the BA value to be elevated compared to regions where no segregation occurs.
References
2. Local PCA Shows How the Effect of Population Structure Differs Along the Genome