Section Practical 4: Downstream Analysis - TarifenoLab/Hands-on-RNA-seq-Data-Analysis GitHub Wiki

So far, we have got the table with our differential expressed genes and also we could explore several global analyses like the MA-plot, PCA, and Heatmap. However, even if we have now a global idea about what is going on with the data...we have no clue about the biological meaning of these results. So, let´s jump on the "making biological sense of your data" session! 🤓🤓🤓

First, let´s make your result table more friendly!

The results table from DESeq2 has more than 32.000 lines!!!!. This is basically the number of genes that pass the initial filters and were statistically tested...Of course, and as well bioinformaticians that we are, we will not go and look at the table line-by-line.

Customizing Results table:

Extract Gene name from BioMart hub at Ensembl WebSite

To have a better idea about the genes modulated, we will incorporate the gene name into the result table. To do so, we need to get a database where we have the Ensembl ID + Gene names.

To do so, we will use the Hub tool called BioMart, which allows crossing different databases in order to have the most complete annotation possible.

  • Click in Attributes --> GENE --> Select Gene stable ID and Gene name
  • Click in Results
  • To Export the results, select: Export all Results to --> File --> TSV --> Click in Unique results only (save the file in your computer, we will imported to Galaxy)

Importing Gene Names to Galaxy

  • Go to Galaxy and click on Upload Data
    • Choose local files --> browse to the file we have downloaded using biomart (Set as name: Gene Names) --> Start

Incorporate Gene Names into your table

  • Go to the tools menu in Galaxy

    • GENERAL TEXT TOOLS --> Join, Subtract, and Group --> Join two Datasets
    • Join: Select table extracted "Gene_Names"
    • Using column: Column:1
    • with: Select DESeq2 result file
    • and column: Column:1
    • Keep all the rest of options in "NO"
    • Click Execute
  • Once the output is ready in you history, you can rename it (e.i. DE_table)

Get your differential expressed genes and add the normalized count

  • Go to GENERAL TEXT TOOLS --> Filter and Sort --> Filter

    • Filter --> DE_Table
    • With following condition --> c9 < 0.05 (column 9 correspond to the p-adj value)
    • Number of header lines to skip 0
  • Execute

  • Change the name of your table to DE_Genes

  • Go to the tools menu in Galaxy

    • GENERAL TEXT TOOLS --> Join, Subtract, and Group --> Join two Datasets
    • Join: Select table extracted "DE_Genes"
    • Using column: Column:1
    • with: Select Normalized counts
    • and column: Column:1
    • Keep all the rest of options in "NO"
    • Click Execute
  • Change the name of your table to DE_Genes_final table and download the file to your computer

  • Open excel in your computer and perform the following actions:

    • Click on Open a file and look for the table
      • file origin --> 20127: US-ASCII --> Next
      • Select Delimiters --> Tab --> Next
      • Adavanced... --> Advanced Text Import Settings
        • Decimal Separator --> . (change for a dot)
        • Thousands separator --> , (change for a comma) --> OK --> Finish
  • With this the decimal will be imported correctly

  • Remove C and J columns in the excel file

  • Add a header line following:

    • Column A --> Ensembl ID
    • Column B --> Gene Name
    • Column C --> Base mean
    • Column D --> Log2FC
    • Column E --> StdErr
    • Column F --> Wald-Stats
    • Column G --> P-values
    • Column H --> P-adj
    • Column I --> WT1
    • Column J --> WT2
    • Column K --> WT3
    • Column L --> MUT1
    • Column M --> MUT2
    • Column N --> MUT3
  • Save the table as excel file

  • Based on the column Log2FC sort genes from largest to Smallest
  • Copy the positive and Paste it into a new sheet called Upregulated
  • Do the same for the negative values, this time the new sheet is called Downregulated
    Now you have divided your DE genes by Up and Downregulated

How many Up-regulated genes do we have?
How many Down-regulated genes do we have?
Identify the Top 10 Upregulated and Downregulated genes

Find out the biological function of the top modulated genes

To find out about the biology of our top modulated genes, we will use the webtool panther. This web tool uses the database GENEONTOLOGY from the GO consortium (http://geneontology.org/) which is the largest source about gene function.

  • Go to http://pantherdb.org/

    • Copy the Ensembl ID for the top 10 Downregulated
    • Paste on the Ensembl ID
    • Select are organism "Danio Rerio"
    • Select Analysis --> Functional Classification (either viewed by gene list or graphic charts)
  • Submit

  • Now, do the same with the Upregulated

Panther's Outputs

  • The gene list will provide a table containing information about the protein class and family. We can also display that as a "graphic chart", where we will get the protein groups by functional categories.

Exploring Pathways

Additional to retrieving Gene Ontology information, we can also reconstruct the pathways affected by our treatment. To do so, we can use de webtool DAVID

  • Go to https://david.ncifcrf.gov/tools.jsp
    • On the upload menu (left side):

      • Step 1: Enter Gene List --> Paste all Ensembl ID from downregulated genes
      • Step 2: Select Identifier --> ENSEMBL_GENE_ID
      • Step 3: List Type --> Gene List
      • Step 4: Submit List --> Click on Submit List
    • If everything runs OK, another window titled "Analysis Wizard"

      • Click on functional Annotation Tool
        --> Explore the "Annotation Summary Result" where you will find several categories:
    • Functional Annotation --> click in "chart" of the Biological process

      • To know which genes are driving an specific enrichment --> Click on the blue bar in the column "genes"
    • Gene_Ontology --> click on chart of the terms in red "GOTERM_BP_DIRECT", "GOTERM_CC_DIRECT", "GOTERM_MF_DIRECT"

      • To know which genes are driving an specific enrichment --> Click on the blue bar in the column "genes"
    • Pathways --> Click in KEGG_Pathways "chart"

    • If your click in the name of the pathway (e.i. "protein export") a schema of the pathway with the genes highlighted with be displayed.

Do the GO and the pathways analysis makes sense with the biology??
Now, you can try to use as background a gene list including only the ones expressed in your sample (download the background gene set here https://usegalaxy.org/datasets/bbd44e69cb8906b5033da1915c8841a6/display?to_ext=tabular)

Gene Set Enrichment Analysis

This analysis focused on a collection of genes belonging to two classes (e.i. Mutant vs. WT), which are ordered in a rank according to their differential expression between the classes. This method will evaluate enrichment at the level of the gene sets (which are defined by prior knowledge, such as published data). The method GSEA determines whether the members of a gene set occur toward the top (or bottom) of the ranked list, in which case the gene set is correlated with the phenotypic class.

Preparing the data for GSEA Analysis:

To run GSEA analysis, you will need to prepare the input data according to the requirement of the program you plan to use. Today, we will run GSEA analysis with fgsea on GALAXY which requires a ranked list of genes and the Gene sets.

  • We will provide a list will all the expressed genes and log2FC ranked in descending order.

We use all genes expressed instead of the DE because GSEA chacks if the genes in the gene set accumulate on either side of the ranked list, meaning that they tend to be globally more up or downregulated in your data (for more information look at https://www.biostars.org/p/429563/).

GSEA RESULTS

From the two outputs, open the "plots" (clicking on the eye icon) and discovered the pathways that were found significatively enriched in your data. To help you read the GSEA plot, I let you the user manual of GSEA software (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html)

Google the pathways ID and make some conclusions about your results

Well....with this activity we have finished the practical sections. I hope you like the course and that all we have learned during the course will be useful for you in the future!