VIII. DNA Methylation - abishpius/R-for-Computational-Biology GitHub Wiki
DNA Methylation Data Analysis in R
In this unit we will show an example of analyzing methylation data. We will use colon cancer data from TCGA. The data was created with the Illumina 450K array and we have already processed the raw data to create matrix with methylation measurements. The script that creates these ojects is here: https://github.com/genomicsclass/labs/blob/master/Rscripts/read_tcga_meth.R
Let's begin by loading the data
# devtools::install_github("genomicsclass/coloncancermeth")
library(S4Vectors)
library(coloncancermeth)
data(coloncancermeth)
We know have three tables one containing the methylation data, one with information about the samples or columns of the data matrix, and granges object with the genomic location of the CpGs represetned in the rows of the data matrix
dim(meth) ##this is the methylation data
dim(pd) ##this is sample information
length(gr)
The pd
object includes clinical information. One the columns tells us if the sample is from colon cancer or from normal tissue
colnames(pd)
table(pd$Status)
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")
Let's start by taking a quick look at the distribution of methylation measurements for the normal samples:
i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
for(i in normalIndex){
lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
lines(density(meth[,i],from=0,to=1),col=2)
}
We are interested in finding regions of the genome that are different between cancer and normal samples. Furthermore, we want regions that are consistenly different therefore we can treat this as an inference problem. We can compute a t-statistic for each CpG:
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- eBayes(fit)
A volcano plot reveals many differences:
library(rafalib)
splot(fit$coef[,2],-log10(eb$p.value[,2]),xlab="Effect size",ylab="-log10 p-value")
If we have reason to believe for DNA methylation to have an effect on gene expression a region of the genome needs to be affected, not just a single CpG, we should look beyond. Here is plot of the region surrounding the top hit:
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)
plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
We can search for these regions explicitly instead of searching for single points, as explained by Jaffe and Irizarry (2012) [http://www.ncbi.nlm.nih.gov/pubmed/22422453].
If we are going to perform regional analysis we first have to define a region. But one issue is that not only do we have to separate the analysis by chromosome but that within each chromosome we usually have big gaps creating subgroups of regions to be analyzed.
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")
We can create groups in the following way.
# BiocManager::install("bumphunter")
library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
Now let's consider two example regions:
###Select the region with the smallest value
Index<- which(cl==cl[which.min(fit$coef[,2])])
matplot(pos[Index],meth[Index,],col=cols,pch=1,xlab="genomic location",ylab="methylation")
x1=pos[Index]
y1=fit$coef[Index,2]
plot(x1,y1,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=0,lty=2)
abline(h=c(-.1,.1),lty=2)
This region shows only a single CpG as different. In contrast, notice this region:
Index=which(cl==72201) ##we know this is a good example from analysis we have already performed
matplot(pos[Index],meth[Index,],col=cols,pch=1,xlab="genomic location",ylab="methylation")
x2=pos[Index]
y2=fit$coef[Index,2]
plot(x2,y2,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=0,lty=2)
abline(h=c(-.1,.1),lty=2)
If we are interested in prioritizing regions over single points, we need an alternative approach. If we assume that the real signal is smooth, we could use statistical smoothing techniques such as loess. Here is an example two regions above
lfit <- loess(y1~x1,degree=1,family="symmetric",span=1/2)
plot(x1,y1,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=c(-.1,0,.1),lty=2)
lines(x1,lfit$fitted,col=2)
lfit <- loess(y2~x2,degree=1,family="symmetric",span=1/2)
plot(x2,y2,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=c(-.1,0,.1),lty=2)
lines(x2,lfit$fitted,col=2)
The bumphunter automates this procedure:
res<-bumphunter(meth,X,chr=chr,pos=pos,cluster=cl,cutoff=0.1,B=0)
tab<-res$table
We now have a list of regions instead of single points. Here we look at the region with the highest rank if we order by area:
Index=(tab[1,7]-3):(tab[1,8]+3)
matplot(pos[Index],meth[Index,,drop=TRUE],col=cols,pch=1,xlab="genomic location",ylab="Methylation",ylim=c(0,1))
plot(pos[Index],res$fitted[Index,1],xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=c(-0.1,0,.1),lty=2)
The function also allows from smoothing and permutation based inference for the regions. However, we do not recommend running the function with these options without the ability to parallelize.
Reading Raw 450k Array Data with minfi
In this unit we will demonstrate how to read idat files from the illumina 450K DNA methylation array. We make use the the Bioconductor minfi package [cite 24478339].
# BiocManager::install(c("minfi","IlluminaHumanMethylation450kmanifest","IlluminaHumanMethylation450kanno.ilmn12.hg19"))
library(minfi)
The first step is to determine the basename of the idat files. Note that for each sample we have two files: one for red and green channels respectively. These files are found here: https://github.com/genomicsclass/rawdata/tree/master/idats
path <- "idats"
list.files(path)
Let's start by reading in the csv file, which contains clinical information. This has one row for each sample and one of the columns includes the "basenames" for the files.
targets<-read.csv("idats/targets.csv",as.is=TRUE)
names(targets)
targets$Basename
To make this script work in any working directory we can edit that column to contain the absolute paths. Then we are ready to read in the raw data with read.metharray
:
targets$Basename <- file.path(path,targets$Basename)
rgset <- read.metharray(targets$Basename,verbose=TRUE)
pData(rgset)<-as(targets, "DataFrame")
We now have the raw data, red and green intensities which we have access to:
dim(getRed(rgset))
dim(getGreen(rgset))
If you are not interested in developing preprocessing algorithms then you can use the built in preprocessing algorithm and go straight to an object that give you access to methylation estimates:
mset <- preprocessIllumina(rgset)
This performs the default preprocessing algorithm developed by Illumina. However, for this to be useful, we want to have the locations of each CpG, and to do that we need map the CpGs to genome. minfi keeps this information modular so that when the genome annotation gets updated, one can easily change the mapping.
mset <- mapToGenome(mset)
Now we are ready to obtain the methylation values and CpG locations.
dim(getBeta(mset,type="Illumina")) ##the argument type="Illumina" gives us default procedure
head(granges(mset))
We can also use functions such as getSex
and getQC
on the mset object:
colData(mset)<-getSex(mset)
plotSex(mset)
plot(as.matrix(getQC(mset)))
Inference for DNA Methylation Data
library(minfi) ##Bioc
library(IlluminaHumanMethylation450kmanifest) ##Bioc
library(doParallel) ##CRAN
library(pkgmaker)
library(rafalib)
path="/Users/ririzarr/myDocuments/teaching/HarvardX/tcgaMethylationSubset" # use your own path to downloaded data
targets=read.delim(file.path (path,"targets.txt"),as.is=TRUE)
table(targets$Tissue,targets$Status)
For illustration we will read in the normal colon and lung
index = which( targets$Status=="normal" & targets$Tissue%in%c("colon","lung") )
targets = targets[index,]
dat = read.metharray.exp(base=path,targets = targets, verbose=TRUE)
dat = preprocessIllumina(dat)
dat = mapToGenome(dat)
dat = ratioConvert(dat,type="Illumina")
library(doParallel)
detectCores()
registerDoParallel(cores = 4)
tissue =pData(dat)$Tissue
X= model.matrix(~tissue)
index = which(seqnames(dat)=="chr22")
dat = dat[index,] ## for illustrative purposes
res=bumphunter(dat,X,cutoff=0.1,B=1000)
head(res$tab)
library(rafalib)
library(AnnotationHub)
cgi = AnnotationHub()["AH5086"](/abishpius/R-for-Computational-Biology/wiki/"AH5086")
tab = res$tab[res$tab$fwer <= 0.05,]
tab = makeGRangesFromDataFrame(tab,keep.extra.columns = TRUE)
map=distanceToNearest(tab,cgi)
d = mcols(map)$distance
prop.table( table( cut(as.numeric(d),c(0,1,2000,5000,Inf),include.lowest=TRUE,right=FALSE) ))
null = granges(dat)
nulltab = makeGRangesFromDataFrame(null,keep.extra.columns = TRUE)
nullmap=distanceToNearest(nulltab,cgi)
nulld = mcols(nullmap)$distance
prop.table( table( cut(nulld,c(0,1,2000,5000,Inf),include.lowest=TRUE,right=FALSE) ))
beta = getBeta(dat)
cols = as.factor(pData(dat)$Tissue)
tab = tab[order(-mcols(tab)$area)]
tab = tab+3000 ##add 3000 to each side
mypar(1,1)
i=17
dataIndex = which(granges(dat)%over%tab[i])
cgiIndex = which(cgi%over%tab[i])
thecgi = cgi[cgiIndex]
pos = start(dat)[dataIndex]
xlim=range(c(pos,start(thecgi),end(thecgi)) )
y = beta[dataIndex,]
matplot(pos,y,col=as.numeric(cols) , xlim=xlim, ylim=c(0,1),ylab="Methylation")
apply(cbind(start(thecgi),end(thecgi)),1,function(x) segments(x[1],0,x[2],0,lwd=4,col=3))
plot(pos,res$fitted[dataIndex],xlim=xlim,ylim=c(-0.4,0.4))
abline(h=0)
apply(cbind(start(thecgi),end(thecgi)),1,function(x) segments(x[1],0,x[2],0,lwd=4,col=3))
table(getIslandStatus(dat))