Differential analysis - uic-ric/uic-ric.github.io GitHub Wiki

Overview of results

When you are provided a set of differential results these may come as an Excel spreadsheet with tabs for the different outputs for a single set of comparisons. NOTE: The first tab will often be a README that describes the results. Descriptions are also included below:

  • _diff_summary - Summary of the number of differentially expressed features for each differential test run at QValue ≤ 0.05.
  • _counts - Counts per feature: the number of observations in raw reads.
  • _norm - Normalized expression per feature. Units are CPM (counts per million). Normalization accounts for differences in sequencing depth across libraries, allowing expression levels to be directly compared between samples.
    • Additional normalization with TMM (trimmed mean of M-values) scaling may be performed. TMM normalization is more robust to outlier features, and seeks to ensure that the average log-fold-change across samples is 0.
    • In some cases, these values may be in log2 scale, please check the README tab to confirm. If you see negative values, that definitely indicates that the values are log-scaled.
  • _avg - Average normalized expression per sample group. Units are CPM (counts per million). If the _norm tab is in log-scale, these values will be too.
  • _diff - Differential expression statistics. See section below for more details.
  • _lib - Normalization and dispersion estimates based on different differential comparisons. Normalization factors include the library size and TMM factor estimated by edgeR. Dispersion estimates include the common dispersion estimated for each comparison, and the corresponding biological coefficient of variation (BCV), equal to the square-root of the dispersion.
    • NOTE: we typically expect BCVs ~10%-30% (0.1 to 0.3) for model experimental systems (e.g., cell lines, mouse models). Technical replicates would be smaller (<1%). Clinical or environmental data sets may be much bigger (>50%). Comparing your BCV to these expectations is a useful way to judge how variable your data set is.

Normalization

When performing differential analysis, counts of the features needs to be comparable between samples. An increase sequencing depth should result in an increase in the raw counts for each feature. Thus, the raw counts need to be normalized before analysis. Standard unit is CPM (counts per million), also sometimes TPM (transcripts per million), or RPM (reads per million).

In essence, CPM is equal to percent × 1 million

TMM normalization

Sometimes CPM values can be skewed if there are a few very highly expressed/abundant features. TMM normalization will calculate an extra factor under the assumption that most features have log-fold-change of 0 (no change). TMM stands for trimmed mean of M-values. The trimmed mean indicates that the top and bottom percent of features are excluded when calculating the normalization factor. However, these features are still included when performing the statistical comparisons.

In this example, CPM normalization results in extremely highly expressed genes in some samples driving down the mean expression for all other genes. TMM normalization corrects for this effect – compare the median values across samples with CPM+TMM normalization.

Differential expression statistics

The differential expression statistics may have the following details, depending on how the analysis was performed.

  • Statistics from generalized linear models, modeling the effect of multiple factors or sample groups simultaneously. These results are similar to the results from a multi-way ANOVA. There will be 2 columns per factor test, one for PValue, and one for QValue (FDR-corrected p-value). You may see tests both for individual factors (e.g., Treatment: QValue), and for interaction terms between factors (e.g., Treatment::Genotype: QValue). Interaction terms test whether one factor's effect depends on the other factor (e.g., if the effect of a treatment depends on genotype).

  • Pair-wise comparisons between samples. You will see four columns per comparison, all starting with the set of the groups being compared, e.g., Disease/Control. The four columns are as follows.

    • log2FC – the log2 fold change comparing the named groups. A positive value would indicate that the feature (gene) was higher in the group before the slash (/) and a negative value would indicate the reverse. The following reference points can be used when interpreting log2 fold changes.
log2FC -3 -2 -1 0 +1 +2 +3
Interpretation 8 fold decrease 4 fold decrease 2 fold decrease no difference 2 fold increase 4 fold increase 8 fold increase
    • logCPM – The log2 of the average counts per million (CPM) across all the samples.
    • PValue – the nominal p value of the statistical test
    • QValue – the FDR corrected p value.

When looking at the results, you can sort by PValue. However, you should always use the FDR corrected p value (QValue) to determine significance, e.g. QValue ≤ 0.05.

Thresholds on logFC values may also be used, but you should ALWAYS consider the QValue. Filtering solely on logFC will generally yield a large number of very low-expressed features with large apparent fold-changes but no statistical significance. This occurs when comparing very small counts: a change of 1 count to 10 counts looks like a 10-fold difference, but is very unlikely to be meaningful.

The logCPM values are included as a quick reference for mean abundance: smaller values indicate lower abundance. Note that the sign of the logCPM is not meaningful in itself (negative values arise with log-scaling a value less than 1), but the general magnitude is.

Interaction terms

The results from a generalized linear model may include interaction terms between pairs of factors. The interaction terms capture effects where the effect of one factor depends on the level of another factor For example, the effect of treatment depends on genotype (treatment causes up-regulation of a gene in WT, but down-regulation in mutant)

Independent vs. interacting factors

Expression increases with treatment and in mutant, effects are additive. The effect of treatment and genotype are independent of each other

Expression increases with treatment for WT, but decreases for Mut. To know the effect of treatment one must also know the genotype

⚠️ **GitHub.com Fallback** ⚠️