Plotting Mitochondrial Genetic Association Results - janeshdev/ggplot2 GitHub Wiki
In the field of genetic epidemiology, we use genetic association studies to determine if there is a genetic basis for a given disease or trait. These association studies consist of genotyping the DNA of a number of individuals at many sites in the genome which vary between people. Each of these sites of variation is then statistically tested for association with the trait, essentially looking for a trend in which individuals with a disease possess a given genotype more than would be expected by chance. This test statistic can then determine the P-value of the association. The P-value is the probability of observing a test statistic as high or higher than the one which was observed assuming that there is actually no association. Based on an arbitrary cut-off defined by R.A. Fisher, a P-value of 0.05 is deemed significant. In many genetic association studies, particularly genome-wide association studies (GWAS), the results of the association study are plotted as the P-value on the y-axis and the base-pair position of the association on the x-axis. Because GWAS can have 500,000 or 1 million statistical test results, the resulting plot can look like a city-scape and is therefore termed the Manhattan plot.
While a Manhattan plot makes sense for nuclear DNA, in which the DNA is actually linear, the genome of the Human mitochondria is circular and it would therefore be more intuitive to plot the association results as such so that they can be overlaid on the actual mitochondrial genome. The results which are plotted in this example are generated by sampling due to the inability to provide actual data or results, however the results I wished to visualize when creating this plot come from a genetic association study looking at the affect of genetic variation in the mitochondrial genome on the probability of experiencing peripheral neuropathy while taking antiretroviral drugs as therapy for HIV infection. Instead of plotting the P-values as is, they are negative log transformed to better visualize and take advantage of the fact that smaller P-values are more significant. A polar plot is used to visualize the association P-values in which the theta value of each point is the base-pair location in the mitochondrial genome of the genetic variant being tested and the r value of each point is the negative log P-value of the test of association for that variant. Subplots are used to view the results of associations separately for each race strata for which analysis was conducted.
The data consists of 4 columns:
snp - The mitochondrial base-pair position of the genetic variant
p -> The P-value of the statistical association between the genetic variant and Peripheral Neuropathy (The P-value indicates the probability of finding a test statistic as great or greater than the one found for this test given that the null hypothesis of no association is indeed true and a value of 0.05 is deemed significant)
race -> The race stratum in which this association was tested. Race stratification was used to divide the sample into White and African American.
Before generating the plot, a column describing the mitochondrial gene in which the genetic variant is found will be added; addgenelabel is the function which will add this column. Next, I will add on a column for the negative log p-value. Because a smaller P value is better, it is easier to visualize a P-value using a negative log transformation. A set of points to define p = 0.05, the point of significance, will be added. I will also include an extra line to force the y axis to expand because changing the y range is not yet supported for polar plots. A color palette defined by gene will be included to help distinguish neighboring points. A dataframe containing the mitochondrial gene boundaries will be added for reference. A data frame is next generated to define all SNP positions and the genes associated with them.
I have deemed this new plot the Solar Plot.
The interpretation from this plot is that there are a few genetic variants which appear to be associated with risk for peripheral neuropathy and that these risk factors vary by race. In particular, it seems like there are associations with variants in genes coding for subunits of the NADH dehydrogenase (ND1 and ND5) only in the African American cohort while there are associations with variation in genes coding for ribosomal RNA (rRNA) and parts of the Cytochrome c oxidase complex (CO1 in Whites and CO2 in African Americans). Although the implications of this are not readily apparent with respect to biology and are irrelevant in this case due to the use of simulated data, the use of the Solar Plot does assist in the visualization of the results of genetic association studies performed on mitochondrial sequence. In particular for those scientists accustomed to mitochondria but not to genetics, this would be a more intuitive manner to view the data.
library(ggplot2)
mitodata <- data.frame(snp = seq(1,16601,by = 100),p = sample(seq(0,1,by = 0.001),167,replace = TRUE),race = sample(c(1,2),167,replace = TRUE))
addgenelabel <- function(snp,gene) { gene <- ifelse(snp < 577,gene <- "Control-Region", ifelse(snp < 648,gene <- "tRNA", ifelse(snp < 1602,gene <- "rRNA", ifelse(snp < 1671,gene <- "tRNA", ifelse(snp < 3230,gene <- "rRNA", ifelse(snp < 3305,gene <- "tRNA", ifelse(snp < 3307,gene <- "Non-Coding", ifelse(snp < 4263,gene<- "ND1", ifelse(snp < 4332,gene <- "tRNA", ifelse(snp < 4401,gene <- "tRNA", ifelse(snp < 4402,gene <- "Non-Coding", ifelse(snp < 4470,gene <- "tRNA", ifelse(snp < 5512,gene <- "ND2", ifelse(snp < 5580,gene <- "tRNA", ifelse(snp < 5587,gene <- "Non-Coding", ifelse(snp < 5656,gene <- "tRNA", ifelse(snp < 5657,gene <- "Non-Coding", ifelse(snp < 5730,gene <- "tRNA", ifelse(snp < 5826,gene <- "tRNA", ifelse(snp < 5892,gene <- "tRNA", ifelse(snp < 5904,gene <- "Non-Coding", ifelse(snp < 7446,gene <- "CO1", ifelse(snp < 7515,gene <- "tRNA", ifelse(snp < 7518,gene <- "Non-Coding", ifelse(snp < 7586,gene <- "tRNA", ifelse(snp < 8270,gene <- "CO2", ifelse(snp < 8295,gene <- "Non-Coding", ifelse(snp < 8365,gene <- "tRNA", ifelse(snp < 8366,gene <- "Non-Coding", ifelse(snp < 8573,gene <- "ATP8", ifelse(snp < 9208,gene <- "ATP6", ifelse(snp < 9991,gene <- "CO3", ifelse(snp < 10059,gene <- "tRNA", ifelse(snp < 10405,gene <- "ND3", ifelse(snp < 10470,gene <- "tRNA", ifelse(snp < 10767,gene <- "ND4L", ifelse(snp < 12138,gene <- "ND4", ifelse(snp < 12207,gene <- "tRNA", ifelse(snp < 12266,gene <- "tRNA", ifelse(snp < 12337,gene <- "tRNA", ifelse(snp < 14149,gene <- "ND5", ifelse(snp < 14674,gene <- "ND6", ifelse(snp < 14743,gene <- "tRNA", ifelse(snp < 14747,gene <- "Non-Coding", ifelse(snp < 15888,gene <- "CYB", ifelse(snp < 15954,gene <- "tRNA", ifelse(snp < 15956,gene <- "Non-Coding", ifelse(snp < 16024,gene <- "tRNA", ifelse(snp < 17000,gene <- "Control-Region") ))))))))))))))))))) ))))))))))))))))))) ))))))))) ) }
mitodata$gene <- addgenelabel(mitodata$snp,mitodata$gene)
str(mitodata)
mitodata$neglogp <- -1*log10(mitodata$p)
mitodata$neglogpline <- -1*log10(0.05)
mitodata$extraline <- -3
mitodata$race2 <- ifelse(mitodata$race == 1, mitodata$race2 <- "White", "African American")
colours <- c("Control-Region" <- "lightblue4", "tRNA" <- "magenta4", "rRNA" <- "mediumaquamarine", "Non-Coding" <- "sienna4", "ND1" <- "magenta", "ND2" <- "mediumblue", "CO1" <- "olivedrab", "CO2" <- "orange2", "ATP8" <- "orchid4", "ATP6" <- "red3", "CO3" <- "royalblue2", "ND3" <- "palegreen4", "ND4L" <- "grey0", "ND4" <- "pink4", "ND5" <- "yellow4", "ND6" <- "steelblue4", "CYB" <- "tan","red")
visibleboundaries <- c(1,576,1601,3229,4262,5511,7445,8269,9207,9990,10404,10766,12137,14148,14673,15887)
bdries <- data.frame(x = visibleboundaries,y=-.5)
bdries$gene <- addgenelabel(bdries$x,bdries$gene)
lines <- data.frame(x = seq(0,16567,by=1),y = 0)
lines$gene <- addgenelabel(lines$x,lines$gene)
p <- ggplot(mitodata, aes(x = snp,y = neglogp,color = gene)) + geom_point()+ coord_polar(direction = -1) + geom_line(aes(x,1.30,color = "red"),data = lines) + facet_grid(.~race2) + geom_line(aes(y=extraline)) + geom_point(aes(x,y,color = gene),data=lines) + scale_colour_manual(values = colours,"Genes",breaks = c("Control-Region","tRNA","rRNA","Non-Coding","ND1","ND2","CO1","CO2","ATP8","ATP6","CO3","ND3","ND4L","ND4","ND5","ND6","CYB"),labels = c("Control Region","tRNA","rRNA","Non-Coding","ND1","ND2","CO1","CO2","ATP8","ATP6","CO3","ND3","ND4L","ND4","ND5","ND6","CYB"))+ opts(title = "Negative Log P-value of Mitochondrial Hits", axis.text.x = theme_blank(), axis.title.y = theme_blank(), axis.title.x=theme_blank()) + layer(geom="text",mapping =aes(x,y,label = x),data = bdries,size=2.5)
p