EdgeR Multivariate Analysis - ccsstudentmentors/tutorials GitHub Wiki

If you performed your RNA-Seq experiment with more than one treatment group or control group, then you will need to analyze your data slightly differently than we did in the simple EdgeR analysis section. This section describes a method to analyze an RNA-Seq data set with two treatment groups (treatments A and B) and one control group to which they are each compared.

You can either enter the following lines into the R console directly or make an R script:

library(edgeR)
directory = 'C:/RNASeq/ReadCounts/'
setwd(directory)

# load each sample individually
Control1 = read.table('Control1.geneIDCounts.txt',sep='\t',header=FALSE)
Control2 = read.table('Control2.geneIDCounts.txt',sep='\t',header=FALSE)
Control3 = read.table('Control3.geneIDCounts.txt',sep='\t',header=FALSE)
TreatmentA1 = read.table('TreatmentA1.geneIDCounts.txt',sep='\t',header=FALSE)
TreatmentA2 = read.table('TreatmentA2.geneIDCounts.txt',sep='\t',header=FALSE)
TreatmentA3 = read.table('TreatmentA3.geneIDCounts.txt',sep='\t',header=FALSE)
TreatmentB1 = read.table('TreatmentB1.geneIDCounts.txt',sep='\t',header=FALSE)
TreatmentB2 = read.table('TreatmentB2.geneIDCounts.txt',sep='\t',header=FALSE)
TreatmentB3 = read.table('TreatmentB3.geneIDCounts.txt',sep='\t',header=FALSE)

# combine data sets into a matrix
geneCounts = data.frame(Control1[,2],Control2[,2],Control3[,2],TreatmentA1[,2],TreatmentA2[,2],TreatmentA3[,2],TreatmentB1[,2],TreatmentB2[,2],TreatmentB3[,2])
row.names(geneCounts) = Control1[,1]
sizeGeneCounts = dim(geneCounts)
geneCounts = geneCounts[1:(sizeGeneCounts[1]-5),]
condition = c(rep('Control',3),rep('TreatmentA',3),rep('TreatmentB',3))
sampleNames = c('Control1','Control2','Control3','TreatmentA1','TreatmentA2','TreatmentA3','TreatmentB1','TreatmentB2','TreatmentB3')
colnames(geneCounts) = sampleNames

# build the generalized linear model that will be used for differential expression testing
dge <- DGEList(counts=geneCounts, group=condition)
design <- model.matrix(~condition+0, data=dge$samples)
colnames(design) = gsub("condition","",colnames(design))
disp <- estimateGLMCommonDisp(dge, design)
disp <- estimateGLMTrendedDisp(disp, design)
disp <- estimateGLMTagwiseDisp(disp, design)
fit <- glmFit(disp, design)

# tell edgeR what comparisons you want to perform
TreatmentAVsControl = makeContrasts(TreatmentA-Control, levels=design)
TreatmentBVsControl = makeContrasts(TreatmentB-Control, levels=design)

# perform the differential expression testing for those comparisons
lrt.TreatmentAVsControl = glmLRT(fit, contrast=TreatmentAVsControl)
res1 = as.data.frame(lrt.TreatmentAVsControl$table)
lrt.TreatmentBVsControl = glmLRT(fit, contrast=TreatmentBVsControl)
res2 = as.data.frame(lrt.TreatmentBVsControl$table)

# extract the significantly differentially expressed genes
res1Ordered = res1[order(res1$PValue),]
res1Sig = subset(res1Ordered, PValue<0.05)
res2Ordered = res2[order(res2$PValue),]
res2Sig = subset(res2Ordered, PValue<0.05)

# Convert Gene IDs to Gene Names (if applicable)
conversionTable=read.table('C:\RNASeq\GeneIDtoNameConversionTable.txt',sep='\t',header=FALSE,row.names=1)
colnames(conversionTable) = c('GeneName')
nameRes1=merge(conversionTable,res1,by.x=0,by.y=0)
nameRes1Sig=merge(conversionTable,res1Sig,by.x=0,by.y=0)
nameRes2=merge(conversionTable,res2,by.x=0,by.y=0)
nameRes2Sig=merge(conversionTable,res2Sig,by.x=0,by.y=0)

# Print results to file
setwd('C:/RNASeq/EdgeROutput/')
write.table(nameRes1, file='TreatmentAvsControl_DEResults.txt',sep='\t',quote=FALSE)
write.table(nameRes2, file='TreatmentBvsControl_DEResults.txt',sep='\t',quote=FALSE)
setwd('C:/RNASeq/EdgeRDEGenes/')
write.table(nameRes1Sig, file='TreatmentAvsControl_DE_pVal0.05.txt',sep='\t',quote=FALSE)
write.table(nameRes2Sig, file='TreatmentBvsControl_DE_pVal0.05.txt',sep='\t',quote=FALSE)

Line #1 loads EdgeR into memory. After EdgeR has been installed (which we did in the preparing to run EdgeR section), we need to load it each time we start R.
Line #2 creates a variable called 'directory' that we will use several times in this script. The directory specified here should be the one that contains the readCount files
Line #3 moves us to that directory specified in line 2

load each sample individually

Lines 6-14 load each of the example files into memory as data frames.
Each of these data frames will have two columns (called V1 and V2, by default) where the first column is the gene name and the second is the count of reads aligned to that gene in this sample.

combine data sets into a matrix

Lines 17-23 take this loaded data and roll it together into a data frame.
Line #17 takes the second columns (the read counts) of each of the sample data frames and concatenates them together into one big data frame, so each column of this new matrix has the read counts from a single sample.
Line #18 labels the rows of that data frame with the correct gene names.
Line #19 simply creates a new variable that knows how many rows and columns are in this big matrix.
Line #20 removes the last five rows from this matrix of gene counts. Why do we want to do that?
Well, the last five rows contain 'housekeeping' information about the sample such as 'the number of reads that didn't align to a feature', 'the number of reads that aligned to multiple features ambigously', etc.
While this kind of peripheral information could theoretically be useful in constructing an accurate model of the data for each sample, EdgeR doesn't use it (neither does any other program that I know of).

Line #21 creates a list of the sample group (or condition) to which each sample belongs. It is important to make this list in the same order as the order of the columns in the geneCounts matrix.
Line #22 creates a list of the sample names. Again, make sure it is in the same order as the columns in geneCounts.
Line #23 names the columns of the geneCounts matrix according to the names provided in Line 22.

build the generalized linear model that will be used for differential expression testing

Lines 26-32 construct the generalized linear model that will be used for differential expression testing
Line #26 puts the data from the geneCounts matrix into a list-based object used for edgeR input
Line #27 crafts a 'design' matrix that assigns each of our samples to a particular sample group (Experimental or Control, in this case)
Line #28 cleans up the naming of the sample groups in the design matrix (just compensating for a quirk in line 24's execution that I haven't figured out how to get around).
Lines 29-31 use the design matrix and the list-based geneCounts object to estimate the dispersion of the samples Read through the edgeR users manual if you want a thorough explanation of what is happening in those three lines (http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf).
For everyone else, just know that we are trying to compensate for the factors that decrease the fit of the data to the generalized linear model.
Line #32 uses the calculated dispersion estimates to fit our data to a negative binomial generalized linear model

tell edgeR what comparisons you want to perform

Lines 35 and 36 lets us tell edgeR what comparison(s) we want to perform.
Line #35 tells edgeR that we want to perform a comparison of TreatmentA against Control, using Control as the experimental control.
Line #36 tells edgeR that we want to perform a comparison of TreatmentB against Control, using Control as the experimental control.

perform the differential expression testing for those comparisons

Lines 39-42 perform the actual differential expression testing for the comparisons specified in lines 35 and 36
Line #39 does the actual work of testing for differential expression for the comparison of TreatmentA vs Control by using a generalized linear model likelihood ratio test (lrt)
Line #40 puts the results of that LRT into a tabular format as a data frame
Line #41 does the actual work of testing for differential expression for the comparison of TreatmentB vs Control by using a generalized linear model likelihood ratio test (lrt)
Line #42 puts the results of that LRT into a tabular format as a data frame

This results table now tells us the log fold change and false discovery rate adjusted p-value (among other, less important things) of this Experimental vs Control comparison for each gene.
But, we are lazy and don't want to look at all of these genes all the time. Often, we just want to see the ones that are 'significant' in their differential expression. Lines 39 and 40 allow us to decide on our threshold for significance in our differential expression testing and then pull out only the genes that pass that filter.

extract the significantly differentially expressed genes

Line #45 simply sorts the whole list of genes by the value of their false discovery rate adjusted p-value (lowest values first) from the TreatmentA vs Control comparison -- this isn't necessary and can be omitted if you don't want your results ordered this way.
Line #46 filters the results down to only those that have a FDR-adjusted p-value of less than 0.05 (which is what I chose as the threshold for significance for this dataset). You can choose whatever value you want here -- you will just have to justify it one day when you present your results or publish them! The most common values here are 0.05 and 0.1.
Line #47 does the same thing as line 45, but for the TreatmentB vs Control comparison
Line #48 does the same thing as line 46, but for the TreatmentB vs Control comparison

Now that you have the subset of your data that have FDR-adjusted p-values of less than 0.05, we can go ahead and print them to a file.

Convert Gene IDs to Gene Names (if applicable)

If you are using the uselessly enigmatic Ensembl Gene IDs, before we print this data you should add in the gene name data for each of the Gene IDs. Lines 43-46 perform the name conversion and add in that information.
If you used the '-i gene_name' option when running HTSeq-count, then just skip this section and go straight to line 49.
Line #51 loads the conversion table file we made earlier into memory. This file has two columns -- the first is the Gene ID, the second is the corresponding gene name (the official gene symbol)
Line #52 renames the column of data related to the gene names (so that it will be labeled as gene names when it is eventually printed out to file).
Line #53 merges this conversion table with our results table from the TreatmentA vs Control comparison using the gene IDs as the matching key. Because of the order in which we entered these two tables to the 'merge' command, we will get the Gene IDs put as the second column of the now eight column results files.
Line #54 does the same thing as line 53 but only for the significantly differentially expressed genes.
Line #55 does the same thing as line 53 but for the TreatmentB vs Control comparison.
Line #56 does the same thing as line 55 but only for the significantly differentially expressed genes.

Print results to file

Line #59 moves us to the directory called 'EdgeROutput'
Line #60 prints a tab separated text file in that directory containing the information produced by EdgeR about the TreatmentA vs Control comparison for every gene
Line #61 does the same thing as line 60 but for the TreatmentB vs Control comparison.
Line #62 moves us to the directory called 'EdgeRDEGenes'
Line #63 prints a text file similar to the one made in line 60, except that it is filtered to only include the genes that passed our p-value cutoff that we applied in line 46.
Line #64 does the same thing as line 63 but for the TreatmentB vs Control comparison.

Congratulations! These files are now your RNA-Seq results and you may use them to see how well your RNA-Seq experiment results matched your hypothesis!