Methods - molgenis/systemsgenetics GitHub Wiki

#Introduction:

This section will describe the inner workings of how the Genetic data and sequencing data are combined in the

ASreads

section

And it will also discuss how ASE statistics are created in the

sections.

For practical information on how the package can be run, please see Basic usage section and the links above.

Extracting ASreads

ASreads will read a bam- and a genotype file and determine how many reads from the bam file map to either the reference- or alternative allele of a SNP.

This requires the following input:

Quantative sequencing experiment results in the .bam format
A genotype file or directory in the TriTyper or vcf (slow) format
A coupling file with the sequenced sample name and the corresponding individual in the genotype file

The first is required for knowing exactly how many reads of the sequencing contain some allele. The second is required for knowing what genotype an individual has and combine this information with how many reads are reference and alternative The third is to isolate which individual is used with which sample, because genotype files usually contain multiple individuals.

ASreads only uses the information of biallelic SNPs to contain complexity of the program. ASreads does not take into account Genotyping quality, this may or may not be changed in the future, but this is dependant on the needs of the users.

#ASE statistics

ASE statistics are determined using maximum likelihood estimation in combination with a likelihood ratio test (LRT). Five types of ASE statistics are determined in this package:

Binomial LRT P value
Per sample wise dispersion estimate
Beta Binomial LRT P value
Cell type specific binomial LRT P value
Cell type specific beta binomial LRT P value

These are described below.

All these methods use a likelihood function, which is a function that takes data and a specific probability function and determines a value of likelihood, which can be used later in the determination of a P value.

In all cases below, the likelihood is determined using the following steps:

Select a probability density function with fixed parameters
For all SNPs being tested, determine the log of probability density based on the allelic information and the probability density function in (1)
Sum the results from (2), this is the likelihood

Maximum likelihood estimation is done through minimizing the likelihood from (3) by varying the parameters of (1). Note that the data does not change.

Binomial LRT P value

A binomial test can be done based on allele specific reads. The binomial test has a single parameter 'p' (chance of finding a reference read from a sample) which is then optimized by an analytical maximum likelihood solution, and compared to a fixed p which is set to no allelic imbalance ( p = 0.5).

The analytical maximum likelihood solution is determined by determining the proportion of reference reads: (total reads ref) / (total AS reads)

A P value is determined by doing a likelihood ratio test, comparing the likelihood of the maximum likelihood estimated p as alternative hypothesis and the fixed p = 0.5 as the null hypothesis.

##Per sample dispersion estimate.

Overdispersion is determined per sample by taking all the AS reads of a sample and doing a maximum likelihood estimate of the dispersion for a beta binomial, while p stays fixed.

A beta binomial is a binomial distribution with a dispersion estimate s added. So this distribution has two parameters: p and s. To determine the dispersion we set p = 0.5 and we freely estimate s with 0 =< s <= 1. Maximum likelihood is determined minimizing the likelihood function when varying s.

The dispersion is saved per sample (or individual) and the dispersion values are then used in estimating beta binomial maximum likelihood.

##Beta Binomial LRT P value

As discussed in the previous section, the Beta Binomial has two parameters, where the s was estimated previously per sample. The Beta Binomial estimates p by minimizing the likelihood function of the beta binomial by varying p.

In essence, the Beta Binomial is a better predictor, as the normal Binomial is very sensitive to the overdispersed data from a sequencing experiment.

The null hypothesis is, again the likelihood of p = 0.5, while the alternative hypothesis is the likelihood of the freely estimated p.

##Cell type specific binomial LRT P value

To determine cell type specificity, we add cell proportions per individual (cellProp) to the test. This changes how p is input.

We vary p per individual by taking into account two proportions: the celltype proportion and some residual proportion:

p = (p_celltype * cellProp) + p_residual

where 0 <= p <= 1.

Now maximum likelihood is determined by varying two parameters.

The null likelihood is determined by estimating p in the non cell type specific manner, while the alternative likelihood is determined by estimating p with the cell proportion add as described above.

##Cell type specific beta binomial LRT P value

The Cell type specific beta binomial Likelihood ratio test is detemined in the fashion as the normal binomial cell type specific test, however, now, dispersion is add to the test and the distribution is a beta binomial.