Introduction to ASE - molgenis/systemsgenetics GitHub Wiki
Why allele specific expression?
Nearly all animals and some plants are diploid. This means the mother and the father of some animal both provide (almost) identical genetic information to the offspring. Thus the offspring has genetic information in duplicate, through all its normal cells. This duplicate information is interesting, as differences between the genetic information of each parent can change the inner working of the offsprings' cells, and subsequently the rest of the organism.
This software framework tries to find to what extent small changes in the genetic information between alleles (different version of the same gene) of the parents change the inner workings of a cell. The framework uses quantitative sequencing data of multiple individuals to identify to which allele a sequence maps and later to identify the balance between alleles.
Biological reasoning behind ASE
Sister chromatids are usually separated in the nucleus and the genetic information on the chromosome is widely though to act only on its immediate surroundings. Thus, a variant that is different between chromosomes (heterozygote) will have only have an effect on the proteins and sequence in the surrounding area, if any at all.
An example would be a variant in a transcription factor binding site:
Transcription factors bind to specific DNA motifs, and control expression of genes. If a variant is able to change this motif, this will change binding affinity of the transcription factor binding site. The binding affinity is then the cause for the increase or decrease of expression of a gene upon which the transcription factor acts.
As expression is different across both alleles, one can correlate the expression from both alleles to a variant. This is what this software package does.
###Difference between ASE and read depth eQTL
Read depth eQTLs are based on the read depth of a gene, across multiple individuals. ASE eQTLs are only based on read overlap in a heterozygote in the gene, in the same individual. Thus ASE eQTLs use less data, but because the difference is inside the individual, and not across the individuals, the data used is less prone to cofactors influencing the outcome.
Cell type specific Allele specific expression
When taking a sample from an organism it may be difficult to retrieve homogenous tissue samples. To correct for this cofactor, this software package is able to integrate cell count ratios to determine cell type specific ASE effects.
#Examples of allele specific expression detection
To illustrate allele specific expression, two examples are given here:
Example #1: ASE of a single SNP in a gene:
We are interested if the gene "examplase" is under expression control of the genetic variant "rs3x4mpl3". In this example, rs3x4mpl3 is located in a transcribed region of examplase, thus we can measure how many reads from sequencing overlap rs3x4mpl3.
rs3x4mpl3 is a single nucleotide polymorphism (SNP) with two genotype: Adenine (A) and Guanine (G).
Expression was measured in three individuals using RNA-seq and read counts over the SNP were determined. Reads overlapping rs3x4mpl3 are shown in the following table:
Indididual | Reads with allele A | Reads with allele G | Genotype |
---|---|---|---|
Suzie | 22 | 2 | [A, G] |
Peter | 25 | 3 | [A, G] |
Walt | 43 | 0 | [A, A] |
Here we see that two individuals have both alleles (heterozygous), while Walt does not have the G allele.
We can only identify allele specific expression when an individual is a heterozygote for a variant. As the balance between the two alleles cannot be determined if there is only one allele present.
Based on this table we see that there is evidence that the A allele increases expression, as more reads map to the A allele in the heterozygote individuals. While in the G allele, there is less overlap. Afterwards we can do some statistics to increase our confidence in finding if this is really an imbalance, this is discussed in the methods section.
Example #2: ASE of a gene based on a test snp:
Consider a gene examplase2 similar to example #1 that may be under the influence of a variant (named "X") outside a the gene region as shown in the following example:
-----------------------Some test region--------------------------
Test SNP |///////////gene region////////////|
rsX examplase2
gene SNP #
1 2 3 4
Suzie: Het Het Het Het Hom
Allele 1: ~~~T~~~~~~~~~~~~|___A________T________G________C___|~~~~~~~~~~~~
Allele 2: ~~~C~~~~~~~~~~~~|___G________A________A________C___|~~~~~~~~~~~~
Peter: Het Het Hom Hom Hom
Allele 1: ~~~C~~~~~~~~~~~~|___A________A________G________A___|~~~~~~~~~~~~
Allele 2: ~~~T~~~~~~~~~~~~|___G________A________G________A___|~~~~~~~~~~~~
Walt: Hom Het Hom Het Het
Allele 1: ~~~C~~~~~~~~~~~~|___A________A________A________C___|~~~~~~~~~~~~
Allele 2: ~~~C~~~~~~~~~~~~|___G________A________G________A___|~~~~~~~~~~~~
-------------------------------Legend:--------------------------------------
Het: Hetererozygote Hom: Homozygote
"X": Position of the test SNP [A,G,T,C]: Base at SNP position
"~": Non SNP in region "_": Non SNP in gene region
In this example, we use the same individuals as in example #1, but now, instead of using only a single SNP, we use multiple SNPs that are on the same allele. For simplicity, no reads numbers are shown
We call rsX the test SNP, and the SNPs in the gene region gene SNPs. When considering a test SNP we only take into account the individuals that are heterozygous for the test SNP. However we do not take into account the reads over the test SNP, only the reads that overlap the gene SNPs. please note that the a test SNP can also be a gene SNP, then the reads over the test SNP are taken into account, as it is also a gene SNP.
To determine how rsX influences the expression of examplase2, we determine at all the reads that are on the same allele as the the rsX variants. We do the following:
First, Walt is excluded from analysis as it is homozygote for the test SNP, as there would be no difference in allele specific expression for the allele.
Then: determine, per individual, the number of reads that are on each test SNP allele.
We do this by takin all heterozygote SNPs on the test SNP allele, so in case of Suzie this would be for test SNP allele T: all the reads over the A allele of gene SNP 1, the reads over the T allele of gene SNP 2 and the reads of the G allele of gene SNP 3.
Do the same for the other allele of Suzie, and all other individuals that are heterozygote for test SNP rsX.
Finally combine all AS reads of the individuals, and determine how big an imbalance is found using the statistical methods described in the methods section.