VII. RNA Seq - abishpius/R-for-Computational-Biology GitHub Wiki
Introduction to RNA Sequencing
RNA-Seq Overview
Topics on RNA-seq technology and data analysis:
- Introduction to RNA-seq: basic concepts of RNA-seq (lectures)
- RNA-seq bioinformatics: first look at FASTQ files; aligning RNA-seq reads; visualizing alignments (screencasts)
- Normalization of read counts: counting RNA-seq reads or fragments in genes; properties of Poisson and ratios, exploratory data analysis (EDA) and variance stabilization (lecture and screencasts)
- Differential expression at the gene-level: count-based differential expression; surrogate variable analysis: screencasts
- Differential expression across isoforms: differential exon usage; inferring expression of transcripts (screencasts)
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
Introduction
RNA-seq is a valuable experiment for quantifying both the types and the amount of RNA molecules in a sample. We've covered the basic idea of the protocol in lectures, but some early references for RNA-seq include Mortazavi (2008) and Marioni (2008).
In this lab, we will focus on comparing the expression levels of genes across different samples, by counting the number of reads which overlap the exons of genes defined by a known annotation. As described in the lecture, this analysis sets aside the task of estimating the different kinds of RNA molecules, and the different isoforms for genes with multiple isoforms. One advantage of looking at these matrices of counts is that we can use statistical distributions to model how the variance of counts will change when the counts are low vs high. We will explore the relationship of the variance of counts to the mean later in this lab.
Counting reads in genes
In this lab we will examine 8 samples from the airway package, which are from the paper by Himes et al: "RNA-seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells".
This lab will focus on a summarized version of an RNA-seq experiment: a count matrix, which has genes along the rows and samples along the columns. The values in the matrix are the number of reads which could be uniquely aligned to the exons of a given gene for a given sample. We will demonstrate how to build a count matrix for a subset of reads from an experiment, and then use a pre-made count matrix, to avoid having students download the multi-gigabyte BAM files containing the aligned reads. A new pipeline for building count matrices, which skips the alignment step, is to use fast pseudoaligners such as Sailfish, Salmon and kallisto, followed by the tximport package. See the package vignette for more details. Here, we will continue with the read counting pipeline.
First, make variables for the different BAM files and GTF file. Use the sample.table
to contruct the BAM file vector, so that the count matrix will be in the same order as the sample.table
.
library(airway)
dir <- system.file("extdata", package="airway", mustWork=TRUE)
csv.file <- file.path(dir, "sample_table.csv")
sample.table <- read.csv(csv.file, row.names=1)
bam.files <- file.path(dir, paste0(sample.table$Run, "_subset.bam"))
gtf.file <- file.path(dir, "Homo_sapiens.GRCh37.75_subset.gtf")
Next we create an Rsamtools variable which wraps our BAM files, and create a transcript database from the GTF file. We can ignore the warning about matchCircularity
. Finally, we make a GRangesList which contains the exons for each gene.
library(Rsamtools)
bam.list <- BamFileList(bam.files)
library(GenomicFeatures)
# for Bioc 3.0 use the commented out line
# txdb <- makeTranscriptDbFromGFF(gtf.file, format="gtf")
txdb <- makeTxDbFromGFF(gtf.file, format="gtf")
exons.by.gene <- exonsBy(txdb, by="gene")
The following code chunk creates a SummarizedExperiment containing the counts for the reads in each BAM file (columns) for each gene in exons.by.gene
(the rows). We add the sample.table
as column data. Remember, we know the order is correct, because the bam.list
was constructed from a column of sample.table
.
library(GenomicAlignments)
se <- summarizeOverlaps(exons.by.gene, bam.list,
mode="Union",
singleEnd=FALSE,
ignore.strand=TRUE,
fragments=TRUE)
colData(se) <- DataFrame(sample.table)
A similar function in the Rsubread library can be used to construct a count matrix:
# not available for Windows - use above method instead
library(Rsubread)
fc <- featureCounts(bam.files, annot.ext=gtf.file,
isGTFAnnotationFile=TRUE,
isPaired=TRUE)
names(fc)
unname(fc$counts) # hide the colnames
Plot the first column from each function against each other (after matching the rows of the featureCounts matrix to the one returned by summarizeOverlaps.
plot(assay(se)[,1],
fc$counts[match(rownames(se),rownames(fc$counts)),1])
abline(0,1)
Visualizing sample-sample distances
We now load the full SummarizedExperiment object, counting reads over all the genes.
library(airway)
data(airway)
airway
colData(airway)
rowRanges(airway)
The counts matrix is stored in assay
of a SummarizedExperiment.
head(assay(airway))
This code chunk is not necessary, but helps to make nicer plots below with large axis labels (mypar(1,2)
can be substituted with par(mfrow=c(1,2))
below).
# library(devtools)
# install_github("ririzarr/rafalib")
library(rafalib)
mypar()
Note that, on the un-transformed scale, the high count genes have high variance. That is, in the following scatter plot, the points start out in a tight cone and then fan out toward the top right. This is a general property of counts generated from sampling processes, that the variance typically increases with the expected value. We will explore different scaling and transformations options below.
plot(assay(airway)[,1:2], cex=.1)
Creating a DESeqDataSet object
We will use the DESeq2
package to normalize the sample for sequencing depth. The DESeqDataSet object is just an extension of the SummarizedExperiment object, with a few changes. The matrix in assay
is now accessed with counts
and the elements of this matrix are required to be non-negative integers (0,1,2,...).
We need to specify an experimental design here, for later use in differential analysis. The design starts with the tilde symbol ~
, which means, model the counts (log2 scale) using the following formula. Following the tilde, the variables are columns of the colData
, and the +
indicates that for differential expression analysis we want to compare levels of dex
while controlling for the cell
differences.
library(DESeq2)
dds <- DESeqDataSet(airway, design= ~ cell + dex)
We can also make a DESeqDataSet from a count matrix and column data.
dds.fc <- DESeqDataSetFromMatrix(assay(se), # or fc$counts from Rsubread
colData=sample.table,
design=~ cell + dex)
Normalization for sequencing depth
The following estimates size factors to account for differences in sequencing depth, and is only necessary to make the log.norm.counts
object below.
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
colSums(counts(dds))
plot(sizeFactors(dds), colSums(counts(dds)))
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))
Size factors are calculated by the median ratio of samples to a pseudo-sample (the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios in this histogram.
loggeomeans <- rowMeans(log(counts(dds)))
hist(log(counts(dds)[,1]) - loggeomeans,
col="grey", main="", xlab="", breaks=40)
The size factor for the first sample:
exp(median((log(counts(dds)[,1]) - loggeomeans)[is.finite(loggeomeans)]))
sizeFactors(dds)[1]
Make a matrix of log normalized counts (plus a pseudocount):
log.norm.counts <- log2(counts(dds, normalized=TRUE) + 1)
Another way to make this matrix, and keep the sample and gene information is to use the function normTransform
. The same matrix as above is stored in assay(log.norm)
.
log.norm <- normTransform(dds)
Examine the log counts and the log normalized counts (plus a pseudocount).
rs <- rowSums(counts(dds))
mypar(1,2)
boxplot(log2(counts(dds)[rs > 0,] + 1)) # not normalized
boxplot(log.norm.counts[rs > 0,]) # normalized
Make a scatterplot of log normalized counts against each other. Note the fanning out of the points in the lower left corner, for points less than $2^5 = 32$.
plot(log.norm.counts[,1:2], cex=.1)
Stabilizing count variance
Now we will use a more sophisticated transformation, which is similar to the variance stablizing normalization method taught in Week 3 of Course 4: Introduction to Bioconductor. It uses the variance model for count data to shrink together the log-transformed counts for genes with very low counts. For genes with medium and high counts, the rlog
is very close to log2
. For further details, see the section in the DESeq2 paper.
rld <- rlog(dds)
plot(assay(rld)[,1:2], cex=.1)
Another transformation for stabilizing variance in the DESeq2 package is varianceStabilizingTransformation
. These two tranformations are similar, the rlog might perform a bit better when the size factors vary widely, and the varianceStabilizingTransformation is much faster when there are many samples.
vsd <- varianceStabilizingTransformation(dds)
plot(assay(vsd)[,1:2], cex=.1)
We can examine the standard deviation of rows over the mean for the log plus pseudocount and the rlog. Note that the genes with high variance for the log come from the genes with lowest mean. If these genes were included in a distance calculation, the high variance at the low count range might overwhelm the signal at the higher count range.
library(vsn)
meanSdPlot(log.norm.counts, ranks=FALSE)
For the rlog:
meanSdPlot(assay(rld), ranks=FALSE)
For the VST:
meanSdPlot(assay(vsd), ranks=FALSE)
The principal components (PCA) plot is a useful diagnostic for examining relationships between samples:
plotPCA(log.norm, intgroup="dex")
Using the rlog:
plotPCA(rld, intgroup="dex")
Using the VST:
plotPCA(vsd, intgroup="dex")
We can make this plot even nicer using custom code from the ggplot2 library:
library(ggplot2)
(data <- plotPCA(rld, intgroup=c("dex","cell"), returnData=TRUE))
(percentVar <- 100*round(attr(data, "percentVar"),2))
makeLab <- function(x,pc) paste0("PC",pc,": ",x,"% variance")
ggplot(data, aes(PC1,PC2,col=dex,shape=cell)) + geom_point() +
xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2))
In addition, we can plot a hierarchical clustering based on Euclidean distance matrix:
mypar(1,2)
plot(hclust(dist(t(log.norm.counts))), labels=colData(dds)$dex)
plot(hclust(dist(t(assay(rld)))), labels=colData(rld)$dex)
Differential gene expression
Modeling raw counts with normalization
We will now perform differential gene expression on the counts, to try to find genes in which the differences in expected counts across samples due to the condition of interest rises above the biological and technical variance we observe.
We will use an overdispersed Poisson distribution -- called the negative binomial -- to model the raw counts in the count matrix. The model will include the size factors into account to adjust for sequencing depth. The formula will look like:
$$ K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i ) $$
where $K_{ij}$ is a single raw count in our count table, $s_{ij}$ is a size factor or more generally a normalization factor, $q_{ij}$ is proportional to gene expression (what we want to model with our design variables), and $\alpha_i$ is a dispersion parameter.
Why bother modeling raw counts, rather than dividing out the sequencing depth and working with the normalized counts? In other words, why put the $s_{ij}$ on the right side of the equation above, rather than dividing out on the left side and modeling $K_{ij} / s_{ij}$. The reason is that, with the raw count, we have knowledge about the link between the expected value and its variance. So we prefer the first equation below to the second equation, because with the first equation, we have some additional information about the variance of the quantity on the left hand side.
$$ K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} ) $$
$$ \frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij}) $$
When we sample cDNA fragments from a pool in a sequencing library, we can model the count of cDNA fragments which originated from a given gene with a binomial distribution, with a certain probability of picking a fragment for that gene which relates to factors such as the expression of that gene (the abundance of mRNA in the original population of cells), its length and technical factors in the production of the library. When we have many genes, and the rate for each gene is low, while the total number of fragments is high, we know that the Poisson is a good model for the binomial. And for the binomial and the Poisson, there is an explicit link between on observed count and its expected variance.
Below is an example of what happens when we divide or multiply a raw count. Here we show three distributions which all have the expected value of 100, although they have different variances. The first is a raw count with mean 100, the second and third are raw counts with mean 1000 and 10, which were then scaled by 1/10 and 10, respectively.
mypar(3,1)
n <- 10000
brks <- 0:300
hist(rpois(n,100),main="",xlab="",breaks=brks,col="black")
hist(rpois(n,1000)/10,main="",xlab="",breaks=brks,col="black")
hist(rpois(n,10)*10,main="",xlab="",breaks=brks,col="black")
So, when we scale a raw count, we break the implicit link between the mean and the variance. This is not necessarily a problem, if we have 100s of samples over which to observe within-group variance, however RNA-seq samples can often have only 3 samples per group, in which case, we can get a benefit of information from using raw counts, and incorporating normalization factors on the right side of the equation above.
Counts across biological replicates and over-dispersion
For the negative binomial, the variance parameter is called disperison, and it links the mean value with the expected variance. The reason we see more dispersion than in a Poisson is mostly due to changes in the proportions of genes across biological replicates -- which we would expect due to natural differences in gene expression.
mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n,lambda=100),
main="Poisson / NB, disp=0",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.01),
main="NB, disp = 0.01",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.1),
main="NB, disp = 0.1",xlab="",breaks=brks,col="black")
The square root of the dispersion is the coefficient of variation -- SD/mean -- after subtracting the variance we expect due to Poisson sampling.
disp <- 0.5
mu <- 100
v <- mu + disp * mu^2
sqrt(v)/mu
sqrt(v - mu)/mu
sqrt(disp)
A number of methods for assessing differential gene expression from RNA-seq counts use the negative binomial distribution to make probabilistic statements about the differences seen in an experiment. A few such methods are edgeR, DESeq2, and DSS. Other methods, such as limma+voom find other ways to explicitly model the mean of log counts and the observed variance of log counts. A very incomplete list of statistical methods for RNA-seq differential expression is provided in the footnotes.
DESeq2 performs a similar step to limma as discussed in PH525x Course 3, in using the variance of all the genes to improve the variance estimate for each individual gene. In addition, DESeq2 shrinks the unreliable fold changes from genes with low counts, which will be seen in the resulting MA-plot.
Experimental design and running DESeq2
Remember, we had created the DESeqDataSet object earlier using the following line of code (or alternatively using DESeqDataSetFromMatrix)
dds <- DESeqDataSet(airway, design= ~ cell + dex)
First, we setup the design
of the experiment, so that differences will be considered across time and protocol variables. We can read and if necessary reset the design using the following code.
design(dds)
design(dds) <- ~ cell + dex
The last variable in the design is used by default for building results tables (although arguments to results
can be used to customize the results table), and we make sure the "control" or "untreated" level is the first level, such that log fold changes will be treated over control, and not control over treated.
levels(dds$dex)
dds$dex <- relevel(dds$dex, "untrt")
levels(dds$dex)
The following line runs the DESeq2 model. After this step, we can build a results table, which by default will compare the levels in the last variable in the design, so the dex treatment in our case:
dds <- DESeq(dds)
res <- results(dds)
Examining results tables
head(res)
table(res$padj < 0.1)
A summary of the results can be generated:
summary(res)
For testing at a different threshold, we provide the alpha
to results, so that the mean filtering is optimal for our new FDR threshold.
res2 <- results(dds, alpha=0.05)
table(res2$padj < 0.05)
Visualizing results
The MA-plot provides a global view of the differential genes, with the log2 fold change on the y-axis over the mean of normalized counts:
plotMA(res, ylim=c(-4,4))
We can also test against a different null hypothesis. For example, to test for genes which have fold change more than doubling or less than halving:
res.thr <- results(dds, lfcThreshold=1)
plotMA(res.thr, ylim=c(-4,4))
A p-value histogram:
hist(res$pvalue[res$baseMean > 1],
col="grey", border="white", xlab="", ylab="", main="")
A sorted results table:
resSort <- res[order(res$padj),]
head(resSort)
Examine the counts for the top gene, sorting by p-value:
plotCounts(dds, gene=which.min(res$padj), intgroup="dex")
A more sophisticated plot of counts:
library(ggplot2)
data <- plotCounts(dds, gene=which.min(res$padj), intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, col=cell)) +
geom_point(position=position_jitter(width=.1,height=0)) +
scale_y_log10()
Connecting by lines shows the differences which are actually being tested by results given that our design includes cell + dex
ggplot(data, aes(x=dex, y=count, col=cell, group=cell)) +
geom_point() + geom_line() + scale_y_log10()
A heatmap of the top genes:
library(pheatmap)
topgenes <- head(rownames(resSort),20)
mat <- assay(rld)[topgenes,]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[,c("dex","cell")])
pheatmap(mat, annotation_col=df)
Getting alternate annotations
We can then check the annotation of these highly significant genes:
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, keys=topgenes,
columns=c("SYMBOL","GENENAME"),
keytype="ENSEMBL")
anno[match(topgenes, anno$ENSEMBL),]
# for Bioconductor >= 3.1, easier to use mapIds() function
Looking up different results tables
The contrast
argument allows users to specify what results table should be built. See the help and examples in ?results
for more details:
results(dds, contrast=c("cell","N61311","N052611"))
Surrogate variable analysis for RNA-seq
If we suppose that we didn't know about the different cell-lines in the experiment, but noticed some structure in the counts, we could use surrograte variable analysis (SVA) to detect this hidden structure (see PH525x Course 3 for details on the algorithm).
library(sva)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
Do the surrogate variables capture the cell difference?
plot(svseq$sv[,1], svseq$sv[,2], col=dds$cell, pch=16)
Using the surrogate variables in a DESeq2 analysis:
dds.sva <- dds
dds.sva$SV1 <- svseq$sv[,1]
dds.sva$SV2 <- svseq$sv[,2]
design(dds.sva) <- ~ SV1 + SV2 + dex
dds.sva <- DESeq(dds.sva)
The DEXSeq package offers differential testing of exon usage within each gene. Here we will explore the R code used in a DEXSeq analysis. We omit the python calls for preparing the annotation and count tables, but these can be found in the vignette at the above link. The python calls are generally along the lines of:
python dexseq_prepare_annotation.py gtffile.gtf dexseq.gff
python dexseq_count.py dexseq.gff sample1.sam sample1.txt
Once we have repeated the dexseq_count
script for each sample, we can read the data into R using the code chunks below. As we are working with pre-prepared data, we first point to these files which live within the pasilla package.
The pasilla package contains counts from an experiment by Brooks et al
We will run DEXSeq on a subset of the genes, for demonstration purposes.
library("pasilla")
inDir = system.file("extdata", package="pasilla", mustWork=TRUE)
countFiles = list.files(inDir, pattern="fb.txt$", full.names=TRUE)
flattenedFile = list.files(inDir, pattern="gff$", full.names=TRUE)
genesForSubset = read.table(file.path(inDir, "geneIDsinsubset.txt"),
stringsAsFactors=FALSE)[1](/abishpius/R-for-Computational-Biology/wiki/1)
As in DESeq2 we use a sampleTable
to define the samples:
sampleTable = data.frame(
row.names = c( "treated1", "treated2", "treated3",
"untreated1", "untreated2", "untreated3", "untreated4" ),
condition = c("knockdown", "knockdown", "knockdown",
"control", "control", "control", "control" ),
libType = c( "single-end", "paired-end", "paired-end",
"single-end", "single-end", "paired-end", "paired-end" ) )
sampleTable
We now read the data into a DEXSeqDataSet
object:
library("DEXSeq")
dxd = DEXSeqDataSetFromHTSeq(
countFiles,
sampleData=sampleTable,
design= ~ sample + exon + condition:exon,
flattenedfile=flattenedFile )
Subset the genes, for demonstration purposes:
dxd = dxd[geneIDs( dxd ) %in% genesForSubset,]
Now we run the estimation and testing functions:
dxd = estimateSizeFactors( dxd )
dxd = estimateDispersions( dxd )
dxd = testForDEU( dxd )
dxd = estimateExonFoldChanges( dxd, fitExpToVar="condition")
The following code extracts a results table, makes an MA-plot, and draws the expression levels over the exons to highlight differential exon usage:
dxr = DEXSeqResults( dxd )
plotMA( dxr, cex=0.8 )
plotDEXSeq( dxr, "FBgn0010909", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
Again, drawing the expression levels, now showing the annotated transcripts below:
plotDEXSeq( dxr, "FBgn0010909", displayTranscripts=TRUE, legend=TRUE,
cex.axis=1.2, cex=1.3, lwd=2 )
For more details on the DEXSeq software, see the vignette and the paper, which is linked from the vignette page:
browseVignettes("DEXSeq")
Here we show the exploratory plots offered by the cummeRbund package. These plots require loading in a directory in which results from a Cufflinks analysis has been run. Follow the vignette in the above link in order in order to perform a Cufflinks gene- and isoform-level analysis. From the vignette:
CummeRbund begins by re-organizing output files of a cuffdiff analysis, and storing these data in a local SQLite database. CummeRbund indexes the data to speed up access to specific feature data (genes, isoforms, TSS, CDS, etc.), and preserves the various relationships between these features.
library(cummeRbund)
myDir <- system.file("extdata", package="cummeRbund")
gtfFile <- system.file("extdata/chr1_snippet.gtf",package="cummeRbund")
Read in the prepared Cufflinks files from the directory:
cuff <- readCufflinks(dir=myDir,gtfFile=gtfFile,genome="hg19",rebuild=TRUE)
Boxplots of expression (FPKM) at the gene and isoform level:
csBoxplot(genes(cuff))
csBoxplot(genes(cuff),replicates=TRUE)
csBoxplot(isoforms(cuff),replicates=TRUE)
Scatterplot matrix of gene and isoform level expression:
csScatterMatrix(genes(cuff))
csScatterMatrix(isoforms(cuff))
Sample dendrograms using Jensen-Shannon distances:
csDendro(genes(cuff),replicates=TRUE)
csDendro(isoforms(cuff),replicates=TRUE)
MA-plot comparing two conditions:
MAplot(genes(cuff),"hESC","Fibroblasts")
MAplot(isoforms(cuff),"hESC","Fibroblasts")
A "volcano plot" matrix. Each volcano plot is the -log10(p-value) over the log fold change.
csVolcanoMatrix(genes(cuff))
csVolcanoMatrix(isoforms(cuff))
For all of these functions, see the help pages in the cummeRbund package for more details, and check the vignette for a sample workflow. The Cufflinks homepage has details about running the pipeline upstream of producing these figures.
browseVignettes("cummeRbund")