Paper 6 & 7: DESeq & EdgeR - bcb420-2025/Keren_Zhang GitHub Wiki
Paper 6: "Count-based differential expression analysis of RNA sequencing data using R and Bioconductor"
- RNA-seq platform addresses multiple applications including expression analysis, alternative splicing, novel transcript discovery, RNA editing, and non-model organism transcriptomes.
- Initial analysis goal: identify genes with expression level changes between conditions using tools like DESeq and edgeR.
- Sequence steps: from reading sequences, through feature counting, to differential expression discovery.
- Emphasis on quality checks throughout the process.
- Statistical methods used operate on a feature count table, with further quality checks before statistical modeling.
- Assess Sequence Quality Control
- Use ShortRead package to evaluate sequence quality.
- Generate quality assessment report in HTML format for review.
- Collect Metadata of Experimental Design
- Create a metadata table named samples. This table includes sample identifiers, experimental conditions, blocking factors, and file names.
- Map Reads to Reference Genome
- Use tophat2 for mapping reads to the reference genome. Include annotation via a GTF file to assist in mapping across exon-exon junctions.
- Organize, Sort and Index BAM Files
- Sort and index BAM files using samtools. Prepare files for downstream tools like htseq-count.
- Count Reads Using htseq-count
- Integrate read counts into the metadata table. Use htseq-count for assigning reads to genes based on alignment and annotation data.
- Analysis largely conducted within R and Bioconductor for ease of maintenance, training, and portability.
- Discusses the integration of Unix commands within the R environment to streamline processes.
- Emphasizes the importance of recording all commands and software versions used in the analysis to ensure reproducibility.
- Recommends using tools like Sweave or knitR for creating executable documents that combine code and narrative.
- Details on setting up necessary software and downloading example data.
- Provides specific instructions for installing necessary tools and preparing the computational environment.
- The protocol provides a foundation for RNA-seq data analysis, emphasizing reproducibility and the adaptability of the workflow to specific project needs.
- Title: edgeR 4.0: Powerful Differential Analysis of Sequencing Data with Expanded Functionality and Improved Support for Small Counts and Larger Datasets.
- Authors: Yunshun Chen, Lizhong Chen, Aaron T. L. Lun, Pedro L. Baldoni, Gordon K. Smyth.
- Affiliations: Includes the Bioinformatics Division at WEHI, Parkville, VIC, Australia, and Computational Sciences at Genentech Inc., USA.
- Correspondence: Gordon K. Smyth, email: [email protected].
- edgeR is an R/Bioconductor software package designed for differential analysis of sequencing data using read counts.
- Over 15 years of use, edgeR has evolved significantly, now using the negative binomial distribution and generalized linear models for complex experimental designs.
- The new version, edgeR 4.0, introduces infrastructure improvements like support for fractional counts, C++ model fitting, and new functionalities for a variety of analyses including methylation, transcript expression, and more.
- NGS technologies like RNA-seq and ChIP-seq have revolutionized biomedical research, with edgeR providing robust analytical methods.
- edgeR 4.0 adapts to current technological advancements and user feedback, improving functionality especially for complex data types and large datasets.
- Support for Fractional Counts: Allows for more precise handling of data, avoiding the need to round fractional counts.
- Model Fitting in C++: Enhances computational efficiency, particularly beneficial for large datasets.
- Statistical Enhancements: Improved accuracy in the quasi-likelihood pipeline for small counts and integration of empirical Bayes moderation methods.
- Differential Methylation Analysis: Expanded to include analysis of differential methylation patterns.
- Transcript and Exon Usage: New tools for analyzing differential usage at the transcript and exon levels.
- Pathway Analysis: Incorporation of tools to examine gene sets and pathways affected by experimental conditions.
- The updates in edgeR 4.0 address both foundational statistical methods and expand the scope of applicable analyses.
- The enhancements in computational infrastructure ensure edgeR remains a top choice for researchers needing detailed and accurate analysis of genomic data.
- Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., & Robinson, M.D. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9), 1765-1786. DOI: 10.1038/nprot.2013.099.
- Chen, Y., Chen, L., Lun, A. T. L., Baldoni, P. L., & Smyth, G. K. (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv. https://doi.org/10.1101/2024.01.21.576131