NGS V: RNAseq - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Analysis of next generation sequencing data (SC00024)


The purpose of this exercise for you to become familiar with the workflow used to analyze RNAseq expression sequencing data.



The Data

The dataset is a treatment comparison in parathyroid tumors and it is publicly available from the article by Felix Haglund et al., “Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas”, J Clin Endocrin Metab, 2012.

The purpose of the experiment was to investigate the role of the estrogen receptor in parathyroid tumors. The investigators derived primary cultures of parathyroid adenoma cells from 4 patients. These primary cultures were treated with diarylpropionitrile (DPN), an estrogen receptor β agonist, or with 4-hydroxytamoxifen (OHT). RNA was extracted at 24 hours and 48 hours from cultures under treatment and control. We will only focus on one time point (48 hrs).

Treatment sample (P1) sample (P2) sample (P3) sample (P4)
Control SRS308866 SRS308872 SRS308878 SRS308883
DPN SRS308868 SRS308874 SRS308880 SRS308885
OHT SRS308870 SRS308876 SRS308882 SRS308887

The server

  • Connect to the cluster:
ssh -Y your_account@remote_server
  • Load the following modules
blast+/2.14
fastqc/0.12.1
igv/2.16.1
multiqc/1.14
rseqc/5.0.1
R/4.3.1
samtools/1.17
subread/2.0.6

Alignment QC

As you have already performed the quality filtering and alignment steps in previous exercises, we are taking on the analysis from the alignment files. So, just to refresh our minds:

Q. What are the typical steps you perform to obtain the alignment file?

Note: Due to time limit, we will be working with a subset of the data, that corresponds to chr21. Besides, we will be practicing using the for loop to automate tasks since we are dealing with 12 samples and it would be a lot of typing if you do one sample at the time!

  • Create an RNAseq folder in your home directory
    You can directly create a soft link to the actual folders, but let's practice the for loop

  • Within the directory, create the folders db, Alignment and Counts.

  • Create links to:

  1. Alignment files from /home/courses/NGS/RNAseq/Alignment_chr21/ to your Alignment directory. You can use a for loop for this:
for i in /home/courses/NGS/RNAseq/Alignment_chr21/*; do ln -s $i Alignment/; done
  1. Database files from /home/courses/NGS/RNAseq/db to your db directory. Use a for loop as in the example above.

Alignment QC with samtools

Samtools is a great tool to gather the alignment statistics.

  • Go into your Alignment directory and index your BAM files:

    NOTE: Remember that we can use a loop to run a command on multiple samples. See example on how to do it below

for i in *bam; do samtools index $i; done
  • Create statistics files using flagstat and idxstats for each sample:
for i in  *.bam; do samtools flagstat $i > ${i%.bam}.flagstat; done

NOTE: ${i%.bam}.flagstat removes the suffix .bam from the output name and replaces it with the suffix ".flagstat"

  • Run fastqc on each sample
  • Merge the results with multiqc

Have a look at the multiqc HTML report.

Q. Which sample has the largest amount of total reads? How many reads?

Q. What is the overall sequencing quality?

The header of the alignment file contains information of used chromosomes, as well as command lines used to generate the BAM file among others. This can be handy if you don't know how the alignment was created.

  • Use samtools view with the -H flag to display the header
samtools view -H YOUR_BAM_FILE

Q. Which aligner was used to map the reads, which tool version and what organism are we using?

Alignment QC using RSeQC

RSeQC is a program consisting of multiple python scripts that can give you information about the quality of your data.

The script infer_experiment.py from RSeQC provides information about what kind of library protocol has been used (paired end, single end, stranded, non stranded).

Before running RSeQC, let's prepare the reference gene model file (-r):

  • hg38_RefSeq.bed has the annotation for all genes, but since we are just focusing on chr21, it will be quicker to have a subset of the bed file.

    • Create a new file with only genes belonging to chr21 (use grep or awk)
    • Save it as hg38_RefSeq_chr21.bed under your db directory
  • Run infer_experiment.py. Try to use a for loop to automate your analysis:

infer_experiment.py -r LOCATION_OF_YOUR_FILE/hg38_RefSeq_chr21.bed -i YOUR_BAM_FILE > YOUR_BAM_FILE.infer_experiment

Q. What RNA-seq experiment design did you infer? You can check the documentation if you need help to interpret the results

Sometimes you can have problems with the RNA being fragmented (e.g degraded). You can detect this by having more coverage over the one end of the gene compared to the other. You can visualize this by using geneBody_coverage.py:

geneBody_coverage.py -i BAM_FILE1,BAM_FILE2,BAM_FILE3,... -r LOCATION_OF_YOUR_FILE/hg38_RefSeq_chr21.bed -o geneBodyCoverage 

The fragmentation analysis is really slow (it will take around 30 mins to complete on our small dataset), so queue the job using qstat script_name. Revisit NGS-III:-Computer-clusters if needed.

Q. Do you think we have a problem with fragmentation? Have a look at the curves.pdf plot

Gene counts

The next step is to count the amount of reads aligned towards the genes in the reference genome so we can assess gene expression. Let's extract the gene counts for your alignment files of chr21.

There are multiple tools to extract the gene counts, here we will be using featureCounts, one program from the Subread package which comprises a suite of software programs for processing next-gen sequencing read data.

  • Run featureCounts on your bam files.
    • Use Homo_sapiens.GRCh38.109.chr21.gtf as the reference gtf file, which contains the feature (exon/genes) coordinates in the reference genome.
    • Don't forget to set the flags for paired-end data and for unstranded data
    • Save the results in your Counts directory
featureCounts -p --countReadPairs -s FLAG_FOR_UNSTRANDED_DATA -t exon -a LOCATION_OF_YOUR_FILE/Homo_sapiens.GRCh38.109.chr21.gtf -o COUNTS_DIRECTORY/21.counts YOUR_BAM_FILES_SEPARATED_BY_SPACE
  • Familiarize yourself with the 21.counts and 21.counts.summary.

Q. Which Sample had the highest amount of assigned reads? How could the unassiged ambiguity reads have been lowered (linked to the experiment design). How many genes have no expression across all samples?

Paired DE analysis (DESeq2)

DESeq2, one of the statistical packages to perform DE analysis, needs a count matrix of the raw counts as input. A count matrix is a table where the columns represents the samples and the rows represents the genes. Create such a matrix from the count file from the featureCounts step.

Q. What are the dimensions of your matrix?

Now you will use DESeq2, an R package that estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution. In this exercise you will try some different design options.

You have created a count matrix that contains only chr21. For this part of the analysis you will use the entire dataset.

  • Copy the files all_counts.txt and sample_description.txt from /home/courses/NGS/RNAseq/forDE/ to your local computer. Remember you can use scp <origin> <target>.

  • Start R in your local computer.

  • Select your working directory (where you have the all_counts.txt file) by clicking:

    Session -> Set Working Directory -> Choose directory

  • Load the DESeq2 package

  • Read the count data (all_counts.txt) and information about the samples (sample_description.txt)

# loading library

library(DESeq2)
# Not installed? Run:
# install.packages("ggplot2")
library(ggplot2)
# Not installed? Run: 
# if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install("DESeq2")

# reading data
counts <- read.delim("all_counts.txt", row.names=1)
# look at the data
head(counts)

# read metadata
samples <-read.delim("sample_description.txt")

# check what information you have about the samples
samples

# make sure patient number is not numeric
samples$patient <-as.factor(samples$patient)

### Single factor design  

* Run ``DESeqDataSetFromMatrix()`` using ``design = ~treatment``, so we only take the treatment into account for comparing the samples 

ds <- DESeqDataSetFromMatrix(countData = counts, 
                             colData   = samples, 
                             design    = ~treatment)
# check data
data.frame(colData(ds))                                  

# run the DE analysis
dst <- DESeq(ds)                                         
dst
  • Transform the data using vst() (variance stabilizing transformation). This will make the data more suitable for clustering and visualization
  • Plot the results of the Principal Component Analysis,using plotPCA()
# blind means the dispersion will not be recalculated
vsd <- vst(dst, blind=FALSE)
# Plotting the PCA
plotPCA(vsd, intgroup=c("patient", "treatment"))

Q. What does the PCA plot tell you? What type of data cluster together?

You can customize the PCA plot, to better visualize the different samples:

# obtaining the PCA values
pcaData <- plotPCA(vsd,
                   intgroup   = c("treatment", "patient"),
                   returnData = TRUE)

# Calculating the variance
percentVar <- round(100 * attr(pcaData, "percentVar"))

# Plotting the PCA
ggplot(pcaData, aes(PC1, PC2, color=treatment, shape=patient)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
    coord_fixed()

Another way to visualize the clustering of the samples is by applying the dist() function to the transpose of the transformed count matrix to get sample-to-sample distances, and then provide a hierarchical clustering (hc) to the heatmap function based on the sample distances:

# Calculating per sample not per gene
sampleDists <- dist(t(assay(vsd)))

suppressMessages(library("RColorBrewer"))
# Not installed? Run:
#install.packages("RColorBrewer")

# Calculating distances
sampleDistMatrix <- as.matrix(sampleDists)

# Prettifying sample names
rownames(sampleDistMatrix) <- paste(dst$treatment, dst$patient, sep="-")
colnames(sampleDistMatrix) <- rownames(sampleDistMatrix)

# SELECT ONE PALETTE: Reds, Blues, Greens, Purples, Oranges or Greys
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

suppressMessages(library(pheatmap))
# Not installed? Run:
# install.packages("pheatmap")

# printing heatmap
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

Q. Is the hierarchical clustering in accordance with the PCA?

  • Have a look at the results from the DE analysis:
# results generated during the DE analysis
resultsNames(dst)

# extracting DPN_vs_Control
res_DPN <-results(dst, name="treatment_DPN_vs_Control")

# look at 10 most associated for DPN treatment
head(res_DPN[order(res_DPN$padj),], 10)

Q. Do you find any significantly (padj < 0.05) differentially expressed genes?

Another way to visualize the amount of significant genes is by plotting a histogram of the adjusted pvalues:

# Creating histogram of padj
hist(res_DPN$padj, 
     breaks = 100, 
     col    = "skyblue", 
     border = "slateblue", 
     main   = "res_DPN", 
     xlab   = "padj", 
     xlim   = c(0,1))

res_DPN$padj

Do the same visualization, but now extracting OHT_vs_Control. Hint: You can just replace each DPN in the code above with OHT

res_OHT <-results(dst, name="treatment_OHT_vs_Control")
head(res_OHT[order(res_OHT$padj),], 10)

hist(res_OHT$padj, 
     breaks = 100, 
     col    = "skyblue", 
     border = "slateblue", 
     main   = "res_OHT", 
     xlab   = "padj", 
     xlim   = c(0,1))

Q. Do you find any significantly (padj < 0.05) differentially expressed genes in res_DPN and res_OHT?

Multifactor design

Now we will try to do a paired analysis instead, since we have information for the same patient for different treatments.

# setting the design
dsp <- DESeqDataSetFromMatrix(countData = counts, 
                              colData   = samples, 
                              design    = ~patient + treatment)
# running the analysis
dsp <- DESeq(dsp)

# Comparisons automatically generated
resultsNames(dsp)

Q. Which comparisons are available through this design?

Let's extract DPN_vs_Control:

# extracting results
res_DPN_paired <-results(dsp, name="treatment_DPN_vs_Control")  

# selecting with 0.01 as threshold
res_DPN_paired_sig <- res_DPN_paired[ which(res_DPN_paired$padj < 0.01), ]

# checking data
head(res_DPN_paired[order(res_DPN_paired$padj),], 10)

dim(res_DPN_paired[which(res_DPN_paired$padj < 0.05), ])

Q. Do you find any significant DEG now when using this paired design instead?

Q. Generate a histogram of the adjusted values for this paired design and compare it to the one from the previous design. Why are they different?

  • Check the statistically significant genes:
# selecting with 0.01 as threshold
res_DPN_paired_sig <- res_DPN_paired[ which(res_DPN_paired$padj < 0.01), ]

# checking data
res_DPN_paired_sig

Q. How many genes have an adjusted p-value < 0.01?

Follow the same procedure, and extract DPN_vs_Control (you can just replace each DPN in the code above with OHT)

res_OHT_paired <-results(dsp, name="treatment_OHT_vs_Control")  

# selecting with 0.01 as threshold
res_OHT_paired_sig <- res_OHT_paired[ which(res_OHT_paired$padj < 0.01), ]

# checking data
res_OHT_paired_sig

Q. How many genes have a adjusted p-value < 0.01?

  • Investigate if there are common genes:
intersect(rownames(res_DPN_paired_sig),rownames(res_OHT_paired_sig))
  • And make a graphical representation:
# Not installed? 
# install.packages("ggvenn") # install via CRAN

library("ggvenn")

# gene names
ggvenn(list(DPN=rownames(res_DPN_paired_sig),
            OHT=rownames(res_OHT_paired_sig)),
       fill_color = c("red", "blue"))

Q. How many genes are statistically significant and are shared among the treatments? Briefly describe the function of these genes.

Not all pairwise comparisons are rendered in the default analysis. Therefore, if you are interested in one comparison that hasn't been calculated, you just need to use the contrast argument. For instance, to check differences between the two treatments DPN and OHT :

# calculating extra comparison
res_DPN_OHT_paired<-results(dsp,contrast=c("treatment","DPN","OHT"))

# checking data
head(res_DPN_OHT_paired[order(res_DPN_OHT_paired$padj),], 10)

Checking the statistically significant genes:

# filtering based on 0.01 
res_DPN_OHT_paired_sig <- res_DPN_OHT_paired[ which(res_DPN_OHT_paired$padj < 0.01), ]

# checking data
res_DPN_OHT_paired_sig

Q. How many genes have a adjusted p-value < 0.01 between treatments?

  • Generate an MA plot to visualize the log2 fold changes vs the mean normalized counts of all genes (dots in the plot), where a red dot shows statistically significant genes:
plotMA(res_DPN_OHT_paired, 
       main="res_DPN_OHT_paired",
       cex=0.8, 
       ylim=c(-3,3))

Q. What does the triangles in the graph mean?

  • Modify the ylim parameter so you include all the values. Hint: one way is to find the min/max values of the foldchanges: summary(res_DPN_OHT_paired$log2FoldChange)
  • Make a list ordered by padj and then select the top 50 genes:
top=50

# ordering based on padj
res_DPN_OHT_paired_order <- res_DPN_OHT_paired[order(res_DPN_OHT_paired$padj),]

# selecting the top 50
res_DPN_OHT_paired_top <- res_DPN_OHT_paired_order[1:top,]

#checking data
res_DPN_OHT_paired_top
  • Generate a heatmap of the transformed normalized counts:
# transforming data 
vsd <- vst(dsp, blind = FALSE)

# selecting data
res_DPN_OHT_paired_top2heatmap <- assay(vsd)[rownames(assay(vsd))%in%rownames(res_DPN_OHT_paired_top),]

# normalizing to the mean value
res_DPN_OHT_paired_top2heatmap_Mean <- res_DPN_OHT_paired_top2heatmap - rowMeans(res_DPN_OHT_paired_top2heatmap)

library(pheatmap)

# printing heatmap
pheatmap(res_DPN_OHT_paired_top2heatmap_Mean, 
         main         = "res_DPN_OHT_paired top 50 genes (padj)",
         cluster_cols = FALSE)
  • Add some annotation for an easier interpretation:
# Defining the annotation
annotation_col <- data.frame(
  Treatment = rep(c("Ctrl", "DPN", "OHT"),4),
  Patient = rep(c("A1", "A2", "A3","A4"),3)
)

# Defining the colors by group for plotting 
annotation_colors <- list(
  Treatment = c(Ctrl= "red", 
                DPN = "blue", 
                OHT = "green"),
  Patient = c(A1 = "yellow", 
              A2 = "orange", 
              A3 = "black", 
              A4 = "grey")
)

rownames(annotation_col) = samples$sample

pheatmap(res_DPN_OHT_paired_top2heatmap_Mean,
         main              = "res_DPN_OHT_paired top 50 genes (padj)",
         annotation_col    = annotation_col,
         annotation_colors = annotation_colors, 
         cluster_cols      = TRUE)
  • Repeat the visualization using the top 20 genes instead.
  • Visualize the counts of a specific gene grouped by treatment. In this case we are visualizing the gene with smallest padj:
# retrieving data with smallest padj
d<-plotCounts (dsp, 
               gene       = which.min(res_DPN_OHT_paired$padj),
               intgroup   = c("treatment","patient"), 
               returnData = TRUE)

# plotting counts from the different treatments
ggplot(d, aes(x=treatment, y=count, color=patient, shape=treatment)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) +
  scale_y_log10(breaks=c(25,100,400))

Remember that you can save these results for processing outside R:

write.table(res_DPN_OHT_paired, 
            file      = "res_DPN_OHT_paired.txt", 
            col.names = TRUE, 
            row.names = TRUE, 
            sep       = "\t", 
            quote     = FALSE,
            na        = "NA")

Functional analysis

A common downstream analysis is to find pathways linked to our significant gene list. Here we will perform an over-representation analysis using the package ReactomePA on the gene list where we compared DPN to OHT treatment.

We start by importing the result table from DESeq2 (in case you do the analysis at another time):

# reading the data from a file
res <- read.csv("res_DPN_OHT_paired.txt", 
                header = T, 
                sep    = "\t")
head(res)

Then we load the packages:

library(ReactomePA)
# Not installed? Run:
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install("ReactomePA")

library(AnnotationDbi)
# Not installed? Run: 
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

#BiocManager::install("AnnotationDbi")

library(org.Hs.eg.db)
# Not installed? Run:  
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

#BiocManager::install("org.Hs.eg.db")

library(DOSE)
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install("DOSE")

We select the significant genes (padj < 0.05) and save them to new dataframe called res_sig:

# Selecting genes based on padjusted
res_sig<-res[ which(res$padj < 0.05), ]
nrow(res_sig)

Reactome does not work with ENSEMBL ids, thus we need to convert them to ENTREZ before performing the pathway analysis:

# Mapping identifiers
res_sig$Entrez <- mapIds(org.Hs.eg.db, 
                         keys    = row.names(res_sig), 
                         column  = "ENTREZID", 
                         keytype = "ENSEMBL")
head(res_sig)

We extract the fold changes of all the significant genes and we set the ENTREZ ids as name, saving them to a variable called ressigFC. There will be genes that lack and ENTREZ id, these are annotatied as NA. For the analysis to run properly, it is necessary to remove these genes.

# Selecting fold changes
ressigFC<-res_sig$log2FoldChange

# Adding the gene name to the list of fold changes
names(ressigFC)<-res_sig$Entrez

# Removing genes without an ENTREZ id
ressigFC<-ressigFC[!is.na(names(ressigFC))]
length(ressigFC)

Then we run an over-representation analysis using enrichPathway(). This function uses a hypergeometric model to test if the amount of genes from the input list associated to the pathway is larger then expected. Read more here.

# Running the test
x <- enrichPathway(gene         = names(ressigFC),
                   pvalueCutoff = 0.05, 
                   readable     = T)

# Looking at the results
dfx<-data.frame(x)
knitr::kable(dfx)

Output columns are described as follows

  • ID - Reactome identifier
  • Description - Pathway description
  • GeneRatio - How many genes in our list hit this pathway of the entire input list
  • BgRatio - How many genes does the pathway contain of the total background
  • Pvalue - significant level
  • p.adj - Adjusted Pvalue, default Benjamini hochberg
  • qvalue - Adjusted Pvalue (FDR)
  • geneID - the genes from the input list that hits the pathway
# Saving the results
write.table(x         = dfx,
            file      = "DPN_OHT_ReactomePA.txt", 
            sep       = "\t",  
            quote     = FALSE, 
            col.names = NA)

Let's plot the most significant pathways:

edox <- setReadable(x, 'org.Hs.eg.db', 'ENTREZID')

p1 <- cnetplot(x, color.params=list(foldChange=ressigFC))
p2 <- cnetplot(x, color.params=list(foldChange=ressigFC), circular = TRUE)

pdf("DPN_OHT_ReactomePA_Plots.pdf", height = 15, width= 15)
p1
p2
dev.off()

Investigate the outputs DPN_OHT_ReactomePA.txt and DPN_OHT_ReactomePA_Plots.pdf.

Q. Do you find any interesting pathways that can be linked to this study?



Developed by Sanna Abrahamsson and Katarina Truvé, 2018 Modified by Marcela Davila, 2019. Modified by Sanna Abrahamsson and Marcela Dávila, 2021 Modified by Sanna Abrahamsson, 2023

⚠️ **GitHub.com Fallback** ⚠️