Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)

The aim of this practical is to introduce you to a general workflow to perform a functional analysis based on the results of an RNAseq experiment.

The Data

You will be working with the result from the data from the Expression analysis of pediatric and adult patients with COVID-19 practical. In brief, Pierce et. al. compared bulk RNA sequencing from pediatric and adult patients with COVID-19.

Use the file covid_res.tsv containing the results from the DE analysis (BMA231 V: Expression analysis). The file contains the list of genes with fold changes and pvalues, among other columns.

Load the data

Open R
Set your working directory by clicking on Session -> Set Working Directory -> Choose directory. Remember this directory is the one where you have your results file.
Read your results file:

# Reading DE results data
res<-read.table("covid_res.tsv", 
                sep       = "\t",
                header    = T, 
                row.names = 1, 
                as.is     = T)
head(res)

Click here for output

            baseMean log2FoldChange     lfcSE
WASH7P     67.203529     0.04007269 0.2978377
MIR6859-1   9.853639     0.10833982 0.7756060
AL627309.7 42.703556    -0.04152576 0.7635039
AL627309.8 36.545260    -0.19776336 0.8294208
AL627309.5 19.409275    -0.19343763 0.6184701
FO538757.1 87.863796    -0.26871845 0.3189023
                  stat    pvalue      padj
WASH7P      0.13454539 0.8929713 0.9770008
MIR6859-1   0.13968410 0.8889096 0.9766949
AL627309.7 -0.05438841 0.9566257 0.9896494
AL627309.8 -0.23843551 0.8115433 0.9581304
AL627309.5 -0.31276796 0.7544570 0.9422520
FO538757.1 -0.84263572 0.3994322 0.8016662

Q1. How many genes were loaded. Hint: type str(res)

Q2. What is padj?

Just to visualize (and remember) how the data looks like, let's create a volcano plot. You will:

Select a threshold for the fold change and save it in a variable called foldChange
Define a threshold for the pvalue and save it in a variable called alfa
Create a mycolors vector, containing 3 different colors for down-regulated, no_change, and up-regulated genes
Use ggplot to create a volcano plot

# Loading library
library(ggplot2)
#install.packages('knitr', dependencies = TRUE)
library(knitr)

# defining thresholds
foldChange <- WRITE_YOUR_THRESHOLD_HERE
alfa <- WRITE_YOUR_PADJ_THRESHOLD_HERE
mycolors <- WRITE_YOUR_THREE_COLORS_HERE_AS_A_VECTOR

# labeling DE by creating an extra column
res$de<-"no change"
res$de[res$log2FoldChange > foldChange & res$pvalue < alfa] <- "up-regulated"
res$de[res$log2FoldChange < -foldChange & res$pvalue < alfa] <- "down-regulated"

# Plotting fold changes vs pvalues, highlighting up and down-regulated
ggplot(data=res, 
       aes(x=log2FoldChange, y=-log10(pvalue), col=de, label=rownames(res))) + 
       geom_point() + 
       scale_colour_manual(values = mycolors)

Q3. Which foldchange did you select? Why?. Include the graph in your report.

Gene Ontology analysis

In this part of the practical we will use our significant gene list to perform a gene set enrichment analysis to find affected GO terms using topGO. Let´s load the library (install it if you don't have it already):

#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# Installing topGO and human database
# BiocManager::install("topGO")
# BiocManager::install("org.Hs.eg.db")

# Loading topGO and human database
library(topGO)
library(org.Hs.eg.db)

We need to generate an object that can be read by the topGO functions. In this step we need to specify which ontology we want to analyze, let's start with the Biological Process:

# Defining the ontology to use
ONT="BP"

# Creating the dataset
res_GO <- new("topGOdata", 
              description = "Covid-19 project",                 # any name
              ontology    = ONT,                                # ontology to analyze
              allGenes    = setNames(res$padj, rownames(res)),  # list of all genes
              geneSel     = function(x) x<0.01,                 # our significant genes
              nodeSize    = 10,                                 # number of proteins so the node is included
              mapping     = "org.Hs.eg.db",                     # organism
              ID          = "symbol",                           # type of identifier
              annot       = annFUN.org)                         # annotation database

# Displaying the info
res_GO

Q4. Looking at the object how many significant genes are used in the analysis, why are not all 1227 significant genes used?

Then we perform the enrichment test. As usual there are several statistical tests, here we will be using Kolmogorv Smirnov (KS).

This test is used when you have "manipulated" data as input, such as pvalues. KS test compares the distribution of P-values expected at random to the observed distribution of P-values.
We will complement if by using the weigth01 method, which goes through the GO hierarchy discarding significant descendant GO-terms if it hits further up, and then compares the significant scores to detect the most significant term in the hierarchy:

# Running the statistical test
resultKS <- runTest(res_GO, 
                    algorithm = "weight01", 
                    statistic = "ks")       

# Creating a table with the results
res_GO_table_BP <- GenTable(res_GO,
                            Weigth1_KS = resultKS, 
                            topNodes   = 20)            #top 20 GO to a table 

# Displaying the results
knitr::kable(res_GO_table_BP, caption = "Biological Process top 20")

The different columns in the table are described below:

GO.ID - Gene ontology identifier
Term - Description of the GO term
Annotated - Number of genes containing the specific GO term
Significant - Number of genes that are significantly enriched in our gene list for the specific GO term
Expected - Number of genes that are expected to be significant by chance
Weigth1_KS - P-value from the KS test with the weight01 method

Q5. How many GOs would you say are related to immune response. Include the table in your report

We can generate the GO graph of these results and directly save it to a pdf file:

# printing the graph into a file, 
# saved in the location you set as working directory

printGraph(res_GO, 
           resultKS,  
           firstSigNodes = 20, 
           fn.prefix     = "GO_graph", 
           useInfo       = "all", 
           pdfSW         = TRUE)

Open the graph (look under your working directory). Squares are the significant GO terms, while the shade of red will depict the degree of significance.

Repeat the same analysis focusing on the Cellular Component (CC). Hint: you just need to replace ONT="BP" to ONT="CC" and when running the GeneTable() function, save it as res_GO_table_CC

Q6. What is the most significant term and it's p-value? Are these ontologies also related to immune response?. Include the table in your report

Now, do the same analysis with the final GO class Molecular Function (MF). Hint: you just need to set ONT="MF" and when running the GeneTable() function, save it as res_GO_table_MF

Q7. Which ontologies relate to immune response? Include the table in your report

ID mapping

Different programs prefer different annotation identifiers. Our data is annotated with the gene symbol. For the following exercises we need to change that to the entrez ID so let´s use the AnnotationDbi to do the annotation:

# Installing AnnotationDbi
# BiocManager::install("AnnotationDbi")

# Loading AnnotationDbi
library(AnnotationDbi)

# Adding entrez id as an extra column
res$entrez <- mapIds(org.Hs.eg.db,        
                     keys    = row.names(res), 
                     column  = "ENTREZID", 
                     keytype = "SYMBOL")

head(res)

Click here for output

            baseMean log2FoldChange     lfcSE        stat
WASH7P     67.203529     0.04007269 0.2978377  0.13454539
MIR6859-1   9.853639     0.10833982 0.7756060  0.13968410
AL627309.7 42.703556    -0.04152576 0.7635039 -0.05438841
AL627309.8 36.545260    -0.19776336 0.8294208 -0.23843551
AL627309.5 19.409275    -0.19343763 0.6184701 -0.31276796
FO538757.1 87.863796    -0.26871845 0.3189023 -0.84263572
              pvalue      padj        de nolabel    entrez
WASH7P     0.8929713 0.9770008 no change      NA    653635
MIR6859-1  0.8889096 0.9766949 no change      NA 102466751
AL627309.7 0.9566257 0.9896494 no change      NA      <NA>
AL627309.8 0.8115433 0.9581304 no change      NA      <NA>
AL627309.5 0.7544570 0.9422520 no change      NA      <NA>
FO538757.1 0.3994322 0.8016662 no change      NA      <NA>

Q8. What percentage of genes were you able to annotate with an ENTREZID?

ReactomePA

Another way to look at the function of our genes, is by using ReactomePA. This R package provides functions for pathway analysis based on the REACTOME pathway database. It implements enrichment analysis, gene set enrichment analysis and several functions for visualization.

First we need to select the significant genes and then run the analysis:

# Selecting significant genes 
res_sig <- res[ which(res$padj < 0.05), ]
res_sig

# Installing ReactomePA
# BiocManager::install("ReactomePA")

# Loading ReactomePA
library("ReactomePA")

# Enrichment test
res_sig_ePA <- enrichPathway(gene         = res_sig$entrez, 
                             pvalueCutoff = 0.1,
                             readable     = TRUE)

res_sig_ePA

Click here for output

#
# over-representation test
#
#...@organism 	 human 
#...@ontology 	 Reactome 
#...@keytype 	 ENTREZID 
#...@gene 	 chr [1:332] "55966" "3604" "26270" "55092" NA "1188" ...
#...pvalues adjusted by 'BH' with cutoff <0.1 
#...6 enriched terms found
'data.frame':	6 obs. of  9 variables:
 $ ID         : chr  "R-HSA-877300" "R-HSA-3000157" "R-HSA-1474244" "R-HSA-381426" ...
 $ Description: chr  "Interferon gamma signaling" "Laminin interactions" "Extracellular matrix organization" "Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs)" ...
 $ GeneRatio  : chr  "9/202" "5/202" "15/202" "9/202" ...
 $ BgRatio    : chr  "92/10899" "30/10899" "300/10899" "125/10899" ...
 $ pvalue     : num  4.97e-05 2.03e-04 4.63e-04 5.17e-04 6.84e-04 ...
 $ p.adjust   : num  0.0315 0.0646 0.0821 0.0821 0.0868 ...
 $ qvalue     : num  0.0291 0.0595 0.0757 0.0757 0.08 ...
 $ geneID     : chr  "GBP3/GBP1/GBP4/GBP5/TRIM31/HLA-DRA/HLA-DRB1/CIITA/SOCS1" "COL7A1/LAMB2/LAMA4/LAMB1/COL4A6" "COL7A1/LAMB2/CD47/KDR/ADAM19/DST/LAMA4/ELN/LAMB1/COL4A6/COL22A1/ITGB7/LTBP2/FBN1/BMP7" "LAMB2/ENAM/LAMB1/PRSS23/IGF1/GAS6/FBN1/C3/APOL1" ...
 $ Count      : int  9 5 15 9 3 8
#...Citation
  Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for
  reactome pathway analysis and visualization. Molecular BioSystems
  2016, 12(2):477-479

Q9. How many significant genes are we selecting? Hint: use nrow(NAME_OF_THE_SIGNIFICANT_GENES_LIST)

Q10. there is only one significant pathway, which one is it? Hint: use head(NAME_OF_THE_ENRICHMENT_RESULTS)

We can visualize these results in many ways. One is the barplot:

Let's start with the barplot:

barplot(res_sig_ePA)

A similar plot is the dotplot, it accepts all the arguments from the barplot:

dotplot(res_sig_ePA, showCategory=20)

The cnetplot is a gene-concept network that show potentially biological complexities (when a gene may belong to multiple annotation categories). The foldChange argument will add coloring of expression to our genes, while the showCategory argument will show you the number of enriched terms to display:

# Selecting the fold changes, the function needs a list
res_sig_fc = res_sig$log2FoldChange
names(res_sig_fc) <- res_sig$entrez
head(res_sig_fc)

cnetplot(res_sig_ePA, 
         foldChange = res_sig_fc)

When we have several categories, the emapplot can get quite busy. An alternative visualization is the heatplot, a heatmap plot showing the different functional classifications:

heatplot(res_sig_ePA,
         foldChange = res_sig_fc)

Q11. Are the results from the GO analysis in agreement with the results from ReactomePA? Include all the graphs from ReactomePA in your report

Great! you now have an idea on what kind of functional analysis and visualization can be done!

Home: Next Generation Sequencing Analysis with clinical applications

Developed by Marcela Dávila, 2021. Updated by Marcela Dávila, 2023.

BMA231 VIII: Pathway analysis - bcfgothenburg/HT24 GitHub Wiki

The Data

Load the data

Gene Ontology analysis

ID mapping

ReactomePA

Home: Next Generation Sequencing Analysis with clinical applications

⚠️ GitHub.com Fallback ⚠️

BMA231 VIII: Pathway analysis - bcfgothenburg/HT24 GitHub Wiki

The Data

Load the data

Gene Ontology analysis

ID mapping

ReactomePA

Home: Next Generation Sequencing Analysis with clinical applications

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️