Lab: Identify Differentially Expressed Genes - statonlab/EPP_575_RNASeq_Workshop GitHub Wiki

Copy Counts Files and Set Up RStudio

As mentioned in the slides, there are multiple uses for DESeq2, but this section will focus on differential expression. More information can be found through DESeq2's online manual:

Analyzing RNA-seq data with DESeq2

First, we will need to copy the results of HTSeq to your computer. The counts files for all samples will be needed: copy the files with this command:

scp <your_username>@sphinx.ag.utk.edu:/pickett_shared/teaching/EPP575_Jan2022/final_outputs/4_featureCount/*.counts.txt .

Make sure the target directory is one you know, as you will need to navigate to the directory to load the files.

From this point on we will be using R; we prefer to use RStudio.

RStudio link.

First, set up the working directory. This should be identical to where you ran scp.

setwd("~/Downloads/EPP575_geneCounts/")

Install DESeq2

Since you likely will not have DESeq2 installed, run the following command:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("DESeq2")
# Load DESeq2 package
library(DESeq2)

From now on you can load DESeq2 with the command library(DESeq2).

Build Sample Table

Our goal is to build a sample table that looks like this:

   sampleName               fileName     mut rep
1 SRR17062759 SRR17062759.counts.txt    max2   1
2 SRR17062760 SRR17062760.counts.txt    max2   2
3 SRR17062762 SRR17062762.counts.txt control   1
4 SRR17062763 SRR17062763.counts.txt control   2

The sample table should always contain the sample names, the file names, and columns for each factor in the experimental design. The factor columns are important for developing the generalized linear model.

Our strategy is going to be to build a vector of elements for each column, then bind them together into a data frame. (You can think of a vector as a simple list).

First, let's get the filenames as a character vector. We can read them in from the folder.

fileName <- list.files(path = ".", pattern = "*.counts.txt")
fileName

Next, we need to get the sample names associated with each file. It is important that these match the file names of the previous list we created. The command I am running will split the file names by ., then take the first string from the split; this will be the sample name, and it should match the order of the fileName object.

sampleName <- sapply(strsplit(fileName, "[.]"), "[[", 1)
sampleName

Finally, we need to create the factors. This will be done for the two factors I have chosen for this experiment: Mutation (mut) and Replicate (rep). As with the file names, it is important that these match the samples.

mut <- factor(c("max2", "max2", "control", "control"))
mut
rep <- factor(c(1, 2, 1, 2))
rep

Now we are able to build the sample table by binding the vectors into a single table.

MaxSampleTable <- data.frame(sampleName, fileName, mut, rep)
MaxSampleTable

DESeq2 Analysis - Mutant vs Control

DESeq comes with a method to use HTSeqCount output data. This one method reads in all the important information - the experimental design from our data frame, the counts from the files, and your GLM model - and creates a new large data object.

DESeqData_Max <- DESeqDataSetFromHTSeqCount(sampleTable = MaxSampleTable, directory = ".", design = ~ rep + mut)

Now we can run the DESeq function on the DESeq Data Set:

DESeqData_Max.Res <- DESeq(DESeqData_Max)

We can now look at how the data clusters in a Principal Component cluster. What information about this data can you gain from this type of chart?

rld <- rlog(DESeqData_Max, blind = FALSE)
plotPCA(rld, intgroup = c("mut"))

Now we can look at specific, differentially expressed genes. First, we will need to get the results of DESeq2.

MaxRes <- results(DESeqData_Max.Res, alpha = 0.05, contrast = c("mut", "max2", "control"))
summary(MaxRes)

Next, let's look at the MA plot.

plotMA(MaxRes, ylim=c(-6,6))

During the Results step, we set the alpha parameter to 0.05. We can isolate the genes of significance by subsetting the list for only those genes with an adjusted p-value of 0.05 or less. Why use the adjusted p-value over the regular p-value?

MaxRes_sig <- as.data.frame(MaxRes[ which(MaxRes$padj < 0.05),])
MaxRes_sig

Run the next command to sort the remaining genes by adjusted p-value.

MaxRes_sig <- MaxRes_sig[order(MaxRes_sig$padj),]

Use the head command to see, by default, the 6 most differentially expressed genes based on the max2 mutant. You can change the number with the n=<number> option:

head(MaxRes_sig)
head(MaxRes_sig, n=10)

Challenge 1 - Filtering

Are there any genes we would not be interested in including in this analysis? What methods do we have to exclude the genes we are not interested in?

Challenge 2 - 2nd Factor

Rerun this code, but this time look at how the replicate number affects gene expression, if at all. How do you go about this?

Assignment

Print the 12 genes with the most significant differential expression driven by the mutant gene, and send these genes to [email protected] and [email protected].

Example code

An example R script can be found here.

⚠️ **GitHub.com Fallback** ⚠️