TSS Prediction - Integrative-Transcriptomics/tss-prediction-comparison GitHub Wiki

Summary of methods for TSS prediction

Predicting Transcription start sites can be seen as a version of an anomaly detection problem for serial data. We are trying to find areas of the Wiggle file that are significantly different from background noise. From these areas we are trying to predict where exactly these significantly different areas are starting.

This then decomposes the TSS prediction problem into a twofold problem. Prediction of Areas of the Wiggle file that are significantly different from the background noise, aswell as prediction of where in these areas a TSS is located.

For these reasons we propose a three step pipeline for predicting TSS from our data.

Step 1: Finding anomalous areas

This step is fairly standard and can be approached using a variety of classical Anomaly detection algorithms for serial data. Many of these algorithms were designed to work on time series, but as our wiggle file can be modelled as a time series, these algorithms should also be applicable to finding areas containing transcribed genes.

Options include:

(modified) Z score: zscore = (x - avg) / stddev
This score is very simple and how many standard deviations a data point is away from the mean. Then, a threshhold can be selected or learned to determine when a sample point is significantly enough different from background data to be a transcription start site.

Pro: simple, fast and robust
Con: no possibility to adjust to regions of the wiggle file potentially having different background noise, assumes Gaussian error distribution
moving average Z score: similar to the Z score, but instead of comparing each data point to a global average

Pro: able to adjust to variable background noise
Con: slower, harder to finetune, higher chance of amplifying noise, no way to incorporate global information -> only recognizes short anomalies
Isolation Forest: An unsupervised learning algorithm that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Pro: Works well for high dimensional data with complex dependencies, effective for small sample sizes.
Con: can fail on very noisy data, computationally more intensive

Step 2: Finding the TSS

This step involves detecting where in these areas exactly a TSS could be located. There are multiple strategies that could be taken for detection of TSS. The simplest solution would be to always pick the first value in the anomalous region. This approach has the benefit of simplicity, but heavily relies on the accuracy of the previous algorithm, something they might not be designed for.
A more involved approach would be to calculate the derivative of the graph and find the points of the steepest increase inside the previously defined areas. This has the benefit of being able to find secondary TSS sites, aswell as giving additional security over just relying on the previous algorithms prediction. The drawback is that such a method could generate false positives that would later on be needed to be filtered out.

Step 3: Scoring the TSS

Scoring of TSS sites and filtering for likely hits. In this step, potential TSS found by the previous pipeline need to be evaluated.
As the previous steps might produce false positives, we need another step to know how certain we can be that a previously predicted TSS is actually significantly different.
Such a method could be done in multiple ways. The simplest solution would be to calculate a score based upon the steepness of the derivative increase, along with a significance score of the difference to the background noise based upon the prediction method from step 1.
A bit more involved scoring function could be a supervised machine learning model, which gets the same parameters and the same parameters for true TSS sites (for example from the positions calculated by TSS predator) and tries to distinguish between true positives and false positives, which would be all points in the data that are not found in the TSS predator training set.\ The issue here is that this method would rely on TSS predator as the gold standard and score based upon that.\

Roadblocks and the way forward

Current roadblocks to implementation include the fact that it hasnt been clearly specified if precision or recall are more important for the TSS prediction, aswell as the fact that there is no true gold standard data set of ground truths that can be used to evaluate the performance of single component algorithms, aswell as the full algorithm.

Possible remedies to this include finding databases of experimentally verified TSS and accompaning Wiggle files.

Another challenge is the fact that from the provided wiggle files, not all TSS sites might be able to be predicted. Singular replicates or conditions dont contain all the TSS predicted by TSS predator, as not all genes are active at once, making the analysis of the algorithmic performance challenging, as effectively judging false negatives is impossible, as the gene might not be active in the given condition. This means that the algorithm needs to be run on all supplied conditions with the assumption that all necessary condtions are included so all genes are expressed in at least one of them.
One solution to all of these problem could also be the simulation of artificial data, meaning that starting from a dataset that we model to look like a noiseless wiggle file with known transcription start site, we then add noise and test the algorithms on this dataset. This would have the benefit of not needing to rely on experimental data, but could fail to capture the intricacies of actual experimental data.