Paper 6 & 7: DESeq & EdgeR - bcb420-2025/Keren_Zhang GitHub Wiki

Table of Contents

Paper 6: "Count-based differential expression analysis of RNA sequencing data using R and Bioconductor"

Introduction

  • RNA-seq platform addresses multiple applications including expression analysis, alternative splicing, novel transcript discovery, RNA editing, and non-model organism transcriptomes.
  • Initial analysis goal: identify genes with expression level changes between conditions using tools like DESeq and edgeR.

Development of the Protocol

  • Sequence steps: from reading sequences, through feature counting, to differential expression discovery.
  • Emphasis on quality checks throughout the process.
  • Statistical methods used operate on a feature count table, with further quality checks before statistical modeling.

Procedure Steps

  1. Assess Sequence Quality Control
    • Use ShortRead package to evaluate sequence quality.
    • Generate quality assessment report in HTML format for review.
  1. Collect Metadata of Experimental Design
    • Create a metadata table named samples. This table includes sample identifiers, experimental conditions, blocking factors, and file names.
  1. Map Reads to Reference Genome
    • Use tophat2 for mapping reads to the reference genome. Include annotation via a GTF file to assist in mapping across exon-exon junctions.
  1. Organize, Sort and Index BAM Files
    • Sort and index BAM files using samtools. Prepare files for downstream tools like htseq-count.
  1. Count Reads Using htseq-count
    • Integrate read counts into the metadata table. Use htseq-count for assigning reads to genes based on alignment and annotation data.

Software Implementation

  • Analysis largely conducted within R and Bioconductor for ease of maintenance, training, and portability.
  • Discusses the integration of Unix commands within the R environment to streamline processes.

Reproducible Research

  • Emphasizes the importance of recording all commands and software versions used in the analysis to ensure reproducibility.
  • Recommends using tools like Sweave or knitR for creating executable documents that combine code and narrative.

Equipment Setup

  • Details on setting up necessary software and downloading example data.
  • Provides specific instructions for installing necessary tools and preparing the computational environment.

Conclusion

  • The protocol provides a foundation for RNA-seq data analysis, emphasizing reproducibility and the adaptability of the workflow to specific project needs.

edgeR 4.0 Overview

  • Title: edgeR 4.0: Powerful Differential Analysis of Sequencing Data with Expanded Functionality and Improved Support for Small Counts and Larger Datasets.
  • Authors: Yunshun Chen, Lizhong Chen, Aaron T. L. Lun, Pedro L. Baldoni, Gordon K. Smyth.
  • Affiliations: Includes the Bioinformatics Division at WEHI, Parkville, VIC, Australia, and Computational Sciences at Genentech Inc., USA.
  • Correspondence: Gordon K. Smyth, email: [email protected].

Abstract

  • edgeR is an R/Bioconductor software package designed for differential analysis of sequencing data using read counts.
  • Over 15 years of use, edgeR has evolved significantly, now using the negative binomial distribution and generalized linear models for complex experimental designs.
  • The new version, edgeR 4.0, introduces infrastructure improvements like support for fractional counts, C++ model fitting, and new functionalities for a variety of analyses including methylation, transcript expression, and more.

Introduction

  • NGS technologies like RNA-seq and ChIP-seq have revolutionized biomedical research, with edgeR providing robust analytical methods.
  • edgeR 4.0 adapts to current technological advancements and user feedback, improving functionality especially for complex data types and large datasets.

Key Developments

  • Support for Fractional Counts: Allows for more precise handling of data, avoiding the need to round fractional counts.
  • Model Fitting in C++: Enhances computational efficiency, particularly beneficial for large datasets.
  • Statistical Enhancements: Improved accuracy in the quasi-likelihood pipeline for small counts and integration of empirical Bayes moderation methods.

New Functionalities

  • Differential Methylation Analysis: Expanded to include analysis of differential methylation patterns.
  • Transcript and Exon Usage: New tools for analyzing differential usage at the transcript and exon levels.
  • Pathway Analysis: Incorporation of tools to examine gene sets and pathways affected by experimental conditions.

Conclusion

  • The updates in edgeR 4.0 address both foundational statistical methods and expand the scope of applicable analyses.
  • The enhancements in computational infrastructure ensure edgeR remains a top choice for researchers needing detailed and accurate analysis of genomic data.

References

  • Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., & Robinson, M.D. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9), 1765-1786. DOI: 10.1038/nprot.2013.099.
  • Chen, Y., Chen, L., Lun, A. T. L., Baldoni, P. L., & Smyth, G. K. (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv. https://doi.org/10.1101/2024.01.21.576131
⚠️ **GitHub.com Fallback** ⚠️