Why not support GRCh38? - ChromatinCloud/SeqForge GitHub Wiki
Date: June 6, 2025
1. Context and Problem Statement
Mutational signature analysis, the core function of BaseBuddy, is intrinsically tied to a reference genome. The two dominant human reference builds in modern genomics are GRCh37 (hg19) and GRCh38 (hg38). While GRCh38 is the more current and accurate assembly, a vast and foundational body of cancer genomics research, including the landmark PCAWG (Pan-Cancer Analysis of Whole Genomes) and initial TCGA (The Cancer Genome Atlas) projects, were conducted using GRCh37.
For BaseBuddy to be a useful and reliable scientific tool, a clear and deliberate strategy for handling these different genome builds is required. The primary challenge is to provide a stable, functional tool that is also compatible with the most relevant datasets, without compromising scientific integrity.
2. Core Decision
For its initial stable release, BaseBuddy will exclusively support native GRCh37 analysis. Support for GRCh38 is designated as a high-priority, post-launch major feature enhancement and will be implemented as a separate, native pipeline.
3. Rationale for the Decision
We recognize GRCh37 is a worse build than GRCh38 (e.g. ~1200 gaps versus ~600 at the end of its versioning), which is in turn far worse than HPRC. This "GRCh37-first" approach was chosen for three primary reasons: ensuring compatibility, managing project scope, and prioritizing scientific validity above all else.
At this Cosmic Page https://cancer.sanger.ac.uk/signatures/downloads/ - as you can see - there are a number of downloads for the most "popular" files on different versions of COSMIC and different genome builds. For several variation types, though, the only combination supported is GRCh37 and COSMIC 3.3 - as such this is what is supported at the momemnt.
3.1. Priority: Compatibility with Foundational Datasets
The most immediate priority for BaseBuddy is to be useful to researchers working with existing, large-scale cancer datasets. By aligning with GRCh37, BaseBuddy allows for direct comparison of results with the foundational literature and data from PCAWG and other major consortiums. This compatibility is critical for the tool's adoption and relevance.
3.2. Priority: Managing Scope for a Stable Initial Release
From a software engineering perspective, supporting both genome builds correctly is a significant undertaking. It involves sourcing and managing two distinct sets of signature matrices, implementing conditional logic throughout the application, and creating a user interface for build selection.
By focusing on a single, well-supported build, the development team can concentrate on delivering a robust, thoroughly tested, and stable core product for the initial release.
3.3. Why does automated liftover impact Scientific Integrity
The most critical factor in this decision was the commitment to scientific accuracy. A "shortcut" to supporting GRCh38 was considered and explicitly rejected.
Rejected Alternative: Automated Liftover
The Idea: An alternative was proposed to accept GRCh38 variant files (e.g., VCFs), automatically convert ("liftover") their coordinates to GRCh37, and then process them through the existing GRCh37 pipeline.
Reason for Rejection: While this would provide a veneer of GRCh38 compatibility, it was deemed scientifically unsound for the following reasons:
Risk of Inaccurate Results: The liftover process is imperfect. A portion of genomic coordinates consistently fails to map between builds. This would lead to the silent dropping of user variant data, which could significantly skew the final signature analysis and produce incorrect results.
Risk of Misleading Users: An automated conversion process hides a crucial layer of data manipulation from the user. They might reasonably but incorrectly assume that the results are a direct, native analysis of their GRCh38 data, when in fact it is an analysis of a compromised, converted file. For a scientific tool, this is an unacceptable risk.
4. The Path Forward: A Roadmap for Native GRCh38 Support
The decision to omit GRCh38 from the initial release is not a permanent exclusion but a strategic sequencing of work. The implementation of native GRCh38 support is the next major planned feature, to be executed as follows:
Acquire GRCh38-Specific Data: Source and bundle the official COSMIC/SigProfiler signature matrices that are based on the GRCh38 reference genome.
Implement Build-Selection Logic: Update the CLI and GUI to allow users to explicitly specify the genome build of their input files.
Develop a Parallel Analysis Pipeline: Enhance the backend to route GRCh38 inputs to the corresponding GRCh38 signature matrices, keeping the analysis entirely separate from the GRCh37 pipeline to ensure there is no data cross-contamination.
Expand Test Coverage: Update all smoke and integration tests to include dedicated test cases for the complete GRCh38 workflow.