DESeq2 Simple Analysis Gene Name - ccsstudentmentors/tutorials GitHub Wiki
If you counted reads over gene_name instead of gene_id, then your DESeq2 analysis will be a little different than what we discussed in the section on Simple DESeq2 Analysis.
The code for this situation will look like this:
library(DESeq2)
directory = 'C:/RNASeq/ReadCounts/'
setwd(directory)
# load each sample individually
Control1 = read.table('Control1.geneNameCounts.txt',sep='\t',header=FALSE)
Control2 = read.table('Control2.geneNameCounts.txt',sep='\t',header=FALSE)
Control3 = read.table('Control3.geneNameCounts.txt',sep='\t',header=FALSE)
Experimental1 = read.table('Experimental1.geneNameCounts.txt',sep='\t',header=FALSE)
Experimental2 = read.table('Experimental2.geneNameCounts.txt',sep='\t',header=FALSE)
Experimental3 = read.table('Experimental3.geneNameCounts.txt',sep='\t',header=FALSE)
# combine data sets into a matrix
geneCounts = data.frame(Control1[,2],Control2[,2],Control3[,2],Experimental1[,2],Experimental2[,2],Experimental3[,2])
row.names(geneCounts) = Control1[,1]
sizeGeneCounts = dim(geneCounts)
geneCounts = geneCounts[1:(sizeGeneCounts[1]-5),]
condition = c(rep('Control',3),rep('Experimental',3))
sampleNames = c('Control1','Control2','Control3','Experimental1','Experimental2','Experimental3')
colnames(geneCounts) = sampleNames
colData = data.frame(condition,row.names=sampleNames)
dds = DESeqDataSetFromMatrix(countData = geneCounts, colData=colData, design = ~ condition)
# Differential expression analysis
dds = DESeq(dds)
dds$condition = relevel(dds$condition, 'Control')
res = results(dds)
# Extract the significantly differentially expressed genes
resOrdered = res[order(res$padj),]
resSig = subset(resOrdered, padj<0.05)
# Print results to file
setwd('C:/RNASeq/DESeq2Output/')
write.table(res, file='ExperimentalvsControl_DEResults.txt',sep='\t',quote=FALSE)
setwd('C:/RNASeq/DESeq2DEGenes/')
write.table(resSig, file='ExperimentalvsControl_DE_pAdj0.05.txt',sep='\t',quote=FALSE)
Line #1 loads DESeq2 into memory. After DESeq2 has been installed (which we did here), we need to load it each time we start R.
Line #2 creates a variable called 'directory' that we will use several times in this script. The directory specified here should be the one that contains the readCount
files.
Line #3 moves us to that directory specified in line 2.
load each sample individually
Lines 6-11 load each of the example files into memory as data frames.
Each of these data frames will have two columns (called V1 and V2, by default) where the first column is the gene name and the second is the count of reads aligned to that gene in this sample.
combine data sets into a matrix
Lines 14-22 take this loaded data and roll it together into a data object made for DESeq analysis.
Line #14 takes the second columns (the read counts) of each of the sample data frames and concatenates them together into one big data frame, so each column of this new matrix has the read counts from a single sample.
Line #15 labels the rows of that data frame with the correct gene names.
Line #16 simply creates a new variable that knows how many rows and columns are in this big matrix.
Line #17 removes the last five rows from this matrix of gene counts. Why do we want to do that?
Well, the last five rows contain 'housekeeping' information about the sample such as 'the number of reads that didn't align to a feature', 'the number of reads that aligned to multiple features ambigously', etc.
While this kind of peripheral information could theoretically be useful in constructing an accurate model of the data for each sample, DESeq doesn't use it (neither does any other program that I know of).
In the other example scripts, when we loaded the data in using geneIDs, we were able to use DESeq functions built specifically for loading data from HTSeq which automatically removed this last five lines from the resulting data object.
Line #18 creates a list of the sample group (or condition) to which each sample belongs. It is important to make this list in the same order as the order of the columns in the geneCounts matrix.
Line #19 creates a list of the sample names. Again, make sure it is in the same order as the columns in geneCounts.
Line #20 names the columns of the geneCounts matrix according to the names provided in Line 19.
Line #21 rolls together the information about which samples belong to which sample groups and puts that information into a data frame.
Line #22 puts all of the information about the gene counts, sample names, and sample groups into a special data object made to be used as input for DESeq.
Differential expression analysis
Lines 25-27 perform the differential expression testing.
Line #25 takes the 'dds' object created in line 22 and performs the actual differential expression testing on its data. It returns an updated version of the same object, dds
Line #26 ensures that the results of the differential expression testing will be reported with the 'control' condition being treated as the control group.
An example seems like it would be helpful here. Imagine that you have a gene whose expression is twice as high in the experimental condition than in the control condition. We would want the result of the comparison of this expression to be a Log(2) Fold Change of 1, not -1. By asserting here in Line 26 that the condition called 'Control' is the control group, we ensure that our data comes out in this way which makes intuitive sense when we interpret our results later.
Line #27 just pulls the results from the differential testing into a table of results.
This results table now tells us the log fold change and false discovery rate adjusted p-value (among other, less important things) of this Experimental vs Control comparison for each gene.
But, we are lazy and don't want to look at all of these genes all the time. Often, we just want to see the ones that are 'significant' in their differential expression. Lines 30 and 31 allow us to decide on our threshold for significance in our differential expression testing and then pull out only the genes that pass that filter.
Extract the significantly differentially expressed genes
Line #30 simply sorts the whole list of genes by the value of their false discovery rate adjusted p-value (lowest values first) -- this isn't necessary and can be omitted if you don't want your results ordered this way.
Line #31 filters the results down to only those that have a FDR-adjusted p-value of less than 0.05 (which is what I chose as the threshold for significance for this dataset). You can choose whatever value you want here -- you will just have to justify it one day when you present your results or publish them! The most common values here are 0.05 and 0.1.
Print results to file
Now that you have the subset of your data that have FDR-adjusted p-values of less than 0.05, we can go ahead and print them to a file.
Line #34 moves us to the directory called 'DESeq2Output'
Line #35 prints a tab separated text file in that directory containing the information produced by DESeq2 about this comparison for every gene
Line #36 moves us to the directory called 'DESeq2DEGenes'
Line #37 prints a text file similar to the one made in line 35, except that it is filtered to only include the genes that passed our p-value cutoff that we applied in Line 31.
Congratulations! These files are now your RNA-Seq results and you may use them to see how well your RNA-Seq experiment results matched your hypothesis!