Section Practical 3: Expression, Normalization and DE - TarifenoLab/Hands-on-RNA-seq-Data-Analysis GitHub Wiki

Once we have the mapping done using Hisat2, the next steps are to estimate expression, normalization, and run the differential expression analysis. During this section, we will perform the abovementioned analysis to identify genes regulated on the Pax6b mutant pancreatic cells. To do so, we will implement an approach that allows us to count the number of mapped reads by gene (htseq-count). Then, with the count files and using DESeq2 we will normalize the expression and perform a statistical test to identify differential expression.

Step 1. Estimate expression per-gene

In order to compare gene abundance between two conditions, first, we need to quantify expression using htseq-count. In our case, we will use the alignment files created by the HISAT2 (you must have, one BAM file by replicate...6 files in total).

Before start review the documentation of htseq-count here: https://htseq.readthedocs.io/en/release_0.11.1/count.html

Preparing the BAM files for expression estimation

Before starting the analysis, we need to sort the Alignment files (by coordinates) using Samtools. Sort Bam files are required for

Go to GENOMIC FILE MANIPULATION --> SAM/BAM --> Samtools sort
- Running parameters:
  - BAM File --> Multiple Datasets --> Select all the BAM files
  - Primary sort key --> coordinate
  - Let all other options as default
  - Click on execute

You will get one output (sorted BAM) by BAM file

REMEMBER TO CHANGE THE NAME OF THE OUTPUT TO SOMETHING LIKE HISAT2_MUT1_sorted

Running `htseq-count`

At this point, we need to perform one count by BAM file, however, we you can run all of them at once using batch mode.

Go to RNA-Seq --> htseq-count
- Set the following parameters:
  - Aligned SAM/BAM File: Select all sorted bam it will runs on batch mode
  - GFF File: Select the zebrafish annotation in GTF file "Zebrafish_Annotation"
  - Mode: Union
  - Stranded: No
  - Minimun Aligment quality: 10
  - Feature type: exon > This means that among all features present in the GTF annotation, we will use only exons for the count
  - ID Attribute: gene_id > The count will be reported by gene
  - Set advanced options: Default options
  - Click on Execute

Why did you select Mode: Union ??

Step 2. Perform normalization and Differential Gene expression with `DESeq2`

We have estimated per-gene expression based on the read count, so now, we can perform statistical analysis to identify modulated genes due to Pax b mutation.
DESeq2 is a gold software (open source...of course) that applies a normalization called "size factor". In brief, this normalization works as follows:

Calculates geometric mean of read count among the samples
Divide read count by geometric mean
Mean of calculated values in point 2 correspond to the size factor. As the method of size factor, The 'poscounts' estimator deals with a gene with some zeros, by calculating a modified geometric mean by taking the n-th root of the product of the non-zero counts.

Last but not least, we need to define the condition used to perform the comparisons. This parameter is set in the Factor section, more precisely in "1: Factor" the sample in "factor level 1" will be the reference for comparison. In our case, we want to compare expression in relation to the MUT samples. This will be equivalent to say --> MUT Expression/ WT expression (as WT is the control).

Go to RNA-Seq --> DESeq2
- Set the following parameters:
  - Select dataset per level
  - Factor:
    - 1:Factor
      - Specify a factor name: Pax6b_Mutant
      - Factor level
        
        1: Factor Level
        
        Specify a factor level, typical values could be 'tumor', 'normal', 'treated' or 'control' --> MUT
        
        Count files: Select all 3 replicates for MUT condition
        
        2: Factor Level
        
        Specify a factor level, typical values could be 'tumor', 'normal', 'treated' or 'control' --> WT
        
        Count files: Select all 3 replicates for WT condition
  - (Optional) provide a tabular file with additional batch factors to include in the model --> Nothing Selected
  - Files have header? --> Yes
  - Choice of Input data --> Count data
  - Advanced Options --> "Click on the eye icon"
    * Method for estimateSizeFactors: postcounts
    * Fit type: parametric
    * Turn off outliers replacement (only affects with >6 replicates): NO
    * Turn off outliers filtering (only affects with >2 replicates): YES (will take out highly expressed genes, link insulin) * Turn off independent filtering: NO * Advanced Options --> "Output potions"
    * Output SelectorClick the boxes: Generate plots for visualizing the analysis results
    Output sample size factors
    Output normalised counts
    Output VST normalized table
    * Alpha value for MA-plot: 0.05 (p-adjusted value considered as significant, then, we will assume 5 false positive in 100...don't worry because this value is adjusted by the FDR 💃💃💃)
- Click Execute

DE Outputs

According to the set parameters, we will get 5 outputs:

Result file
DESeq2 plots
Size Factors (the normalization factors)
Normalized counts file
VST-Normalized counts (data transformation to plot global data)

Result File

This file is a Tab-separated table including the following parameters, which are the result of the statistical test. Here, you will find the differentially expressed genes. As we used the annotation coming from Ensembl genome browser, the gene ID are on the Ensembl Stable Gene ID format.

Column explanation here:

Normalized count file

The normalized counts will help to compare the expression between conditions. It is very useful to incorporate these values to the result table, so we can see the expression along with the significance of the DE test.

DESeq2 plots

In this pdf file, you will find several plots telling about QC of the data and results. We will focus on, PCA and Heatmap describing similarities among replicates and samples. Then, we will focus on the MA plot 🤓🤓

Evaluating similarity and reproducibility

Principal component Analysis (PCA): PCA is visualized 2-Dimension plots in order to evaluate similarity levels among the samples. Each point corresponds to one sample; the closer the points are the more similar the samples will be. We expect to have replicates clustering together (meaning replicates are similar), and conditions should separate on the Principal component 1 (PCA1) which always explains the most variability (this means that: if conditions cluster away on the PCA1, then the most variability on your data is due to the treatment applied).

Interpretation: Both conditions cluster separately on the PCA1 and this parameter explains 76% variance among samples. Then, the treatment (here Pax6b mutation) is responsible for the most variance identified in the samples).

HeatMap: Display the sample distance of the sample, it is a complementary result to the PCA. Here, we will observe which samples are more similar. The data is transformed into a z-score in which (and according to the legend), dark blue corresponds to 0 distance (identical) and light blue corresponds to highly distant (very different samples).

Interpretation: As we saw on PCA, replicates by a condition are very similar and distant from the other condition. We can also see, that among the replicates there are some samples more similar than others. The overall result supports the conclusions made by the PCA plot, which means that replicates are reproducible.

Visualizing Differential expression test

To have a global idea about gene modulation, we can perform a MA-plot. By doing so, we will visualize expression change (Log2FC or M) and average expression among samples (A from MA-plot). Significantly modulated genes will be colored blue according to the alpha threshold set during DESEq2 running (p-adj value 0.05).