Step 6: Differential Expression Analysis - srkoppolu/SK_RNA-Seq GitHub Wiki

Once the raw counts per genomic/transcriptomic features are computed, the next likely step is to do a differential expression analysis which fits the raw counts to the negative binomial (NB) model and performs rigorous statistical testing to identify the differentially expressed features. Essentially, we want to determine whether the mean expression levels of each gene among the different sample groups are significantly different. The following figure illustrates this well [source]:

Different methods have been used for differential expression analysis such as edgeR and DESeq that are based on NB distributions or baySeq and EBSeq which are Bayesian approaches based on a negative binomial model. It is important to consider the experimental design when choosing an analysis methods. While some of the differential expression tools can only perform pair-wise comparisons, others such as edgeR, limma-voom, DESeq and maSigPro can perform multiple comparisons [source].

DESeq2

DESeq2 builds on good ideas for dispersion estimation and use of Generalized Linear Models (GLM) from the DSS and edgeR methods. Differential expression analysis with DESeq2 involves multiple steps. Briefly, DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth. Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts. Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test.

These steps are listed below:

Estimate size factors
Estimate gene-wise dispersions
Fit curve to gene-wise dispersion estimates
Shrink gene-wise dispersion estimates
GLM fir for each gene.

Note: Before performing the DE analysis, it is important to understand the factors affecting the variation in the data. These sources of variation can be identified during the QC process or through prior knowledge. These factors need to either removed or controlled during satistical modeling as part of the design formula.

Design formula:
A design formula is an important method to define the known sources of variation to control or to define the factors of interest during differential expression testing. It is important that the design formula contain all factors in the metadata that account for major sources of variation. The last factor should be the condition of interest.

For example, if the metadata contains 4 factors (viz., factor_1, factor_2, factor_3 and condition) and you wish to examine the expression differences between conditions, but know that factor_1 and factor_2 include major sources of variation, then a simple design formula would be: design <- ~ factor_1 + factor_2 + condition

If the interactions (or 'the difference of differences') between various factors also need to be explored, then it needs to be specified in the design formula as well. It is suggested in the DESeq2 vignette to create a new factor variable in the metadata to account for the interaction factors. However, it is important to identify any confounding factors and remove them accordingly. For example, if the interaction between factor_1 and condition is to be included as well, it is better to create a new factor variable factor_1_condition that encapsulated both factors, but remove them as independent factors in the design. The new design formula would be: design <- ~ factor_2 + factor_1_condition

Once a design formula is established, it is important to normalize the counts to account for the unintended factors. This ensures that the expression levels are more comparable between and/or within the conditions of interest.

The main factors often considered during normalization are [source]:

Sequencing depth: Variation in the sequencing depth often result in the variations across the expression levels. However, this is only an unintended consequence of the sequence depth and needs to be corrected through normalization. The following illustration describes the effect of sequencing depth.

Gene length: Similar to the sequencing depth, the gene length can also influence the counts mapped to that gene as illustrated below. It is important to account for this unintended factor before DE analysis as well.

RNA composition: Before applying any normalization method, it is important to account for the RNA composition because a few highly differentially expressed genes, differences in the number of genes expressed between samples, or the presence of any contamination may skew some types of normalization methods. Accounting for RNA composition is recommended for accurate comparison of expression between samples, and is particularly important when performing differential expression analyses [ref]. The following illustration describes this effect.

The table below describes some of the common normalization methods to account for such differences in the sequencing data:

Normalization method	Description	Accounted factors	Recommendations for use
CPM (counts per million)	counts scaled by total number of reads	sequencing depth	gene count comparisons between replicates of the same samplegroup; NOT for within sample comparisons or DE analysis
TPM (transcripts per kilobase million)	counts per length of transcript (kb) per million reads mapped	sequencing depth and gene length	gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped)	similar to TPM	sequencing depth and gene length	gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis
DESeq2’s median of ratios	counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene	sequencing depth and RNA composition	gene count comparisons between samples and for DE analysis; NOT for within sample comparisons
EdgeR’s trimmed mean of M values (TMM)	uses a weighted trimmed mean of the log expression ratios between samples	sequencing depth, RNA composition, and gene length	gene count comparisons between and within samples and for DE analysis

.
To run DESeq2 to perform the differential expression analysis,

# Create a `DESeqDataSet` object using the count matrix and the metadata table, and specify the design formula
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ condition)

# Normalizing the counts data using DESeq2's median of ratios method can be done using estimateSizeFactors() function, however it automatically performed by the DESeq() function. 

# Run the differential expression analysis using a single call to the function DESeq(): use the same variable name to fill in the slots of the DESeqDataSet object.
dds <- DESeq(dds)

# Store the results in a different variable
res <- results(dds)

# Order results by padj value (most significant to least) after thresholding
res <- subset(res, padj < 0.05)
res <- res[order(res$padj),]

The DESeq() function performs all the steps from normalization to linear modeling.

Some further steps to save the data and clean for outliers (adapted from source):

# save data results and normalized reads to csv
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized =TRUE)), by = 'row.names', sort = FALSE)
names(resdata)[1] <- 'gene'
write.csv(resdata, file = "Sample_DESEq2-results-with-normalized.csv")

# send normalized counts to tab delimited file for GSEA, etc.
write.table(as.data.frame(counts(dds),normalized=T), file = "Sample_DESEq2_normalized_counts.txt")

# produce DataFrame of results of statistical tests
mcols(res, use.names = T)
write.csv(as.data.frame(mcols(res, use.name = T)),file = "Sample_DESEq2-test-conditions.csv"))

# replacing outlier value with estimated value as predicted by distrubution using
# "trimmed mean" approach. recommended if you have several replicates per treatment
# DESeq2 will automatically do this if you have 7 or more replicates

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < 0.05,
             cleaned = results(ddsClean)$padj < 0.05)
addmargins(tab)
write.csv(as.data.frame(tab),file = "Sample_DESEq2-replaceoutliers.csv"))
resClean <- results(ddsClean)
resClean = subset(res, padj<0.05)
resClean <- resClean[order(resClean$padj),]
write.csv(as.data.frame(resClean),file = "Sample_DESEq2-replaceoutliers-results.csv"))