Section Practical 4: Downstream Analysis - TarifenoLab/Hands-on-RNA-seq-Data-Analysis GitHub Wiki

So far, we have got the table with our differential expressed genes and also we could explore several global analyses like the MA-plot, PCA, and Heatmap. However, even if we have now a global idea about what is going on with the data...we have no clue about the biological meaning of these results. So, let´s jump on the "making biological sense of your data" session! 🤓🤓🤓

First, let´s make your result table more friendly!

The results table from DESeq2 has more than 32.000 lines!!!!. This is basically the number of genes that pass the initial filters and were statistically tested...Of course, and as well bioinformaticians that we are, we will not go and look at the table line-by-line.

Customizing Results table:

Extract Gene name from BioMart hub at Ensembl WebSite

To have a better idea about the genes modulated, we will incorporate the gene name into the result table. To do so, we need to get a database where we have the Ensembl ID + Gene names.

To do so, we will use the Hub tool called BioMart, which allows crossing different databases in order to have the most complete annotation possible.

Go to https://www.ensembl.org/biomart/martview/b4bbeb94c2b3bafeeead79eca27cf925
CHOOSE DATABASE --> Ensembl Genes 106
CHOOSE DATASET --> Zebrafish genes (GRCz11)
Now, in the menu at the left select Filters, and attributes
- Click in Filters --> GENE --> Input external references ID list --> Gene stable ID

Click in Attributes --> GENE --> Select Gene stable ID and Gene name
Click in Results
To Export the results, select: Export all Results to --> File --> TSV --> Click in Unique results only (save the file in your computer, we will imported to Galaxy)

Importing Gene Names to Galaxy

Go to Galaxy and click on Upload Data
- Choose local files --> browse to the file we have downloaded using biomart (Set as name: Gene Names) --> Start

Incorporate Gene Names into your table

Go to the tools menu in Galaxy
- GENERAL TEXT TOOLS --> Join, Subtract, and Group --> Join two Datasets
- Join: Select table extracted "Gene_Names"
- Using column: Column:1
- with: Select DESeq2 result file
- and column: Column:1
- Keep all the rest of options in "NO"
- Click Execute
Once the output is ready in you history, you can rename it (e.i. DE_table)

Get your differential expressed genes and add the normalized count

Go to GENERAL TEXT TOOLS --> Filter and Sort --> Filter
- Filter --> DE_Table
- With following condition --> c9 < 0.05 (column 9 correspond to the p-adj value)
- Number of header lines to skip 0
Execute
Change the name of your table to DE_Genes
Go to the tools menu in Galaxy
- GENERAL TEXT TOOLS --> Join, Subtract, and Group --> Join two Datasets
- Join: Select table extracted "DE_Genes"
- Using column: Column:1
- with: Select Normalized counts
- and column: Column:1
- Keep all the rest of options in "NO"
- Click Execute
Change the name of your table to DE_Genes_final table and download the file to your computer
Open excel in your computer and perform the following actions:
- Click on Open a file and look for the table
  - file origin --> 20127: US-ASCII --> Next
  - Select Delimiters --> Tab --> Next
  - Adavanced... --> Advanced Text Import Settings
    - Decimal Separator --> . (change for a dot)
    - Thousands separator --> , (change for a comma) --> OK --> Finish
With this the decimal will be imported correctly

Remove C and J columns in the excel file
Add a header line following:
- Column A --> Ensembl ID
- Column B --> Gene Name
- Column C --> Base mean
- Column D --> Log2FC
- Column E --> StdErr
- Column F --> Wald-Stats
- Column G --> P-values
- Column H --> P-adj
- Column I --> WT1
- Column J --> WT2
- Column K --> WT3
- Column L --> MUT1
- Column M --> MUT2
- Column N --> MUT3
Save the table as excel file

Based on the column Log2FC sort genes from largest to Smallest
Copy the positive and Paste it into a new sheet called Upregulated
Do the same for the negative values, this time the new sheet is called Downregulated
Now you have divided your DE genes by Up and Downregulated

How many Up-regulated genes do we have?
How many Down-regulated genes do we have?
Identify the Top 10 Upregulated and Downregulated genes

Find out the biological function of the top modulated genes

To find out about the biology of our top modulated genes, we will use the webtool panther. This web tool uses the database GENEONTOLOGY from the GO consortium (http://geneontology.org/) which is the largest source about gene function.

Go to http://pantherdb.org/
- Copy the Ensembl ID for the top 10 Downregulated
- Paste on the Ensembl ID
- Select are organism "Danio Rerio"
- Select Analysis --> Functional Classification (either viewed by gene list or graphic charts)
Submit
Now, do the same with the Upregulated

Panther's Outputs

The gene list will provide a table containing information about the protein class and family. We can also display that as a "graphic chart", where we will get the protein groups by functional categories.

Exploring Pathways

Additional to retrieving Gene Ontology information, we can also reconstruct the pathways affected by our treatment. To do so, we can use de webtool DAVID

Go to https://david.ncifcrf.gov/tools.jsp
- On the upload menu (left side):
  - Step 1: Enter Gene List --> Paste all Ensembl ID from downregulated genes
  - Step 2: Select Identifier --> ENSEMBL_GENE_ID
  - Step 3: List Type --> Gene List
  - Step 4: Submit List --> Click on Submit List
- If everything runs OK, another window titled "Analysis Wizard"
  - Click on functional Annotation Tool
    --> Explore the "Annotation Summary Result" where you will find several categories:
- Functional Annotation --> click in "chart" of the Biological process
  - To know which genes are driving an specific enrichment --> Click on the blue bar in the column "genes"
- Gene_Ontology --> click on chart of the terms in red "GOTERM_BP_DIRECT", "GOTERM_CC_DIRECT", "GOTERM_MF_DIRECT"
  - To know which genes are driving an specific enrichment --> Click on the blue bar in the column "genes"
- Pathways --> Click in KEGG_Pathways "chart"
- If your click in the name of the pathway (e.i. "protein export") a schema of the pathway with the genes highlighted with be displayed.

Do the GO and the pathways analysis makes sense with the biology??
Now, you can try to use as background a gene list including only the ones expressed in your sample (download the background gene set here https://usegalaxy.org/datasets/bbd44e69cb8906b5033da1915c8841a6/display?to_ext=tabular)

Gene Set Enrichment Analysis

This analysis focused on a collection of genes belonging to two classes (e.i. Mutant vs. WT), which are ordered in a rank according to their differential expression between the classes. This method will evaluate enrichment at the level of the gene sets (which are defined by prior knowledge, such as published data). The method GSEA determines whether the members of a gene set occur toward the top (or bottom) of the ranked list, in which case the gene set is correlated with the phenotypic class.

Preparing the data for GSEA Analysis:

To run GSEA analysis, you will need to prepare the input data according to the requirement of the program you plan to use. Today, we will run GSEA analysis with fgsea on GALAXY which requires a ranked list of genes and the Gene sets.

We will provide a list will all the expressed genes and log2FC ranked in descending order.

We use all genes expressed instead of the DE because GSEA chacks if the genes in the gene set accumulate on either side of the ranked list, meaning that they tend to be globally more up or downregulated in your data (for more information look at https://www.biostars.org/p/429563/).

Upload to your history the list of ranked genes (the identifiers I select this time is the Entrez ID) > Don't forget to select Tabular as Type of file
> https://usegalaxy.org/datasets/bbd44e69cb8906b5a88abd1e79013e18/display?to_ext=tabular
Upload the gene set, in this case, we will import the MSigDB - All gene Sets with Entrez Gene ID > Don't forget to select Tabular as Type of file
> https://usegalaxy.org/datasets/bbd44e69cb8906b55590ea31a85fc760/display?to_ext=tabular
Go the GALAXY and in the tools menu look for fgsea (just write the name on the search bar)
Then, select the following options:
* Ranked Genes --> Select DE_Entrez_final
* FIle has header --> Yes
* Gene Sets --> msigdb.v7.5.1.entrez.gmt
* Change to Yes Output Plots and let all the other options on default
* Click in execute

GSEA RESULTS

From the two outputs, open the "plots" (clicking on the eye icon) and discovered the pathways that were found significatively enriched in your data. To help you read the GSEA plot, I let you the user manual of GSEA software (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html)

Google the pathways ID and make some conclusions about your results

Well....with this activity we have finished the practical sections. I hope you like the course and that all we have learned during the course will be useful for you in the future!