Week 6: edgeR - bcb420-2025/Izumi_Ando GitHub Wiki

edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets

This paper introduces edgeR v4 and explains how it's been improved to better handle small counts, large datasets, and complex experimental designs in sequencing analysis.

Citation

Chen, Y., Chen, L., Lun, A. T. L., Baldoni, P. L., & Smyth, G. K. (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv. https://doi.org/10.1101/2024.01.21.576131

Notes

A bit of background on edgeR

  • Has been around since 2008—popular for RNA-seq and ChIP-seq differential analysis.
  • Models read counts using negative binomial GLMs, with empirical Bayes moderation for small sample sizes.
  • One of the first tools to bring rigorous stats to low-replicate NGS data.

edgeR v4: what's new?

  • Fractional counts support: No need to round anymore—good news for transcript-level analyses or probabilistic assignments.
  • C++ backend: Much faster, handles big datasets better.
  • Better handling of small counts: Major rewrite of the quasi-likelihood (QL) pipeline for improved accuracy and speed.
  • Also added features for differential methylation, transcript/exon usage, fold-change threshold testing, and pathway analysis.

Statistical improvements

  • New continuous version of the NB distribution means smoother modeling when counts are tiny.
  • Uses divided counts to strip out technical variation (σ²), leaving just biological variation (ψ)—cool trick to boost interpretability.
  • Enhanced empirical Bayes strategies:
    • Weighted likelihood for NB dispersion
    • limma-style moderation for QL dispersion

Flexibility is a theme

  • You can now assign different dispersions to individual observations, not just features.
  • It supports observation-specific library sizes—useful for non-linear normalization or transcript-length corrections.

Visualization + real data

  • Examples include TCGA breast cancer data and splicing plots for Foxp1.
  • Reiterates that edgeR skips the need for gene-length normalization when testing for differences in expression, not raw levels.

Why this is useful

  • Better FDR control even in small-n scenarios—critical when doing real-world bio work with limited replicates.
  • Very general contrast support (e.g., ANOVA-style tests, group averages, etc.).

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

a foundational paper that introduces edgeR’s core statistical framework for analyzing count-based gene expression data using negative binomial models

Citation

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616

Notes

  • not requireed reading

motivation + model

  • developed for RNA-seq, SAGE, and other digital expression datasets with count-based output
  • models gene counts with the negative binomial distribution to account for both biological and technical variability
  • dispersion parameter is key—reflects overdispersion compared to Poisson

key features of edgeR (at the time)

  • exact test for two-group comparisons based on NB distribution
  • empirical Bayes methods to estimate dispersion across genes, improving stability in small-n datasets
  • generalized linear model (GLM) framework supports complex experimental designs
  • support for normalization via trimmed mean of M-values (TMM)

strengths

  • handles low-replicate datasets well (esp important for early RNA-seq studies)
  • GLM formulation allows differential expression testing beyond simple case-control
  • implemented in R, integrates with Bioconductor—transparent and scriptable

thoughts

  • very statistical at its core, designed with flexibility in mind
  • one of the first tools that explicitly tackled the overdispersion problem in RNA-seq
  • useful even now for understanding the assumptions behind DE analysis of counts