ATAC‐seq Analysis Recipe - epigen/MrBiomics GitHub Wiki
The ATAC-seq Analysis Recipe takes you from publicly available raw uBAM files derived from a bulk ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) experiment to enrichment analysis results of differentially accessible genomic regions while providing genome browser tracks for quality control, comprehensive reporting on processing and unsupervised analyses. Specifically, it walks you through the workflow described in workflow/rules/CorcesATAC.smk
.
Note
Before you begin, please review our general guide on How to use Recipes and switch your browser to light mode otherwise you might experience reading difficulties with some of the figures.
Tip
Explore the full interactive report:
This wiki page highlights key analyses, decisions and results. For more results explore the complete interactive Snakemake report, where all results specific to this recipe are prefixed with CorcesATAC_*
.
In this tutorial, we want to identify the genomic regions and biological pathways that characterize 13 distinct cell types within the human hematopoietic system. We will re-analyze a published ATAC-seq dataset from Corces et al. (2016), which we call CorcesATAC
, to demonstrate a reproducible and modular analysis workflow. Below we provide a complete guide for analyzing this dataset, starting from retrieving raw sequence files and culminating in the identification of cell-type-specific genomic regions, enrichments and the computational reconstruction of the hematopoietic lineage.
Important
Why: Identifying genomic regions that characterize distinct cell types or biological states is a cornerstone of modern molecular biology, specifically epigenetics the study of gene regulation. This process is fundamental for understanding complex systems like development, immunity, and disease. Furthermore, the resulting regions can serve in basic research or in translational research for diagnostics, therapeutic targeting, or engineering new cellular functions.
At the end of this tutorial, you will have a clear understanding of the scientific goals and the biological context of ATAC-seq analysis.
To achieve our goal, we combine a series of interconnected MrBiomics Modules into a Recipe. Each module performs a specific task, creating a powerful and flexible analysis pipeline. The overall strategy is to first process the raw data, perform quality control, identify differentially accessible genomic regions (DARs) for each cell type, interpret these region lists biologically, and finally, reconstruct the hematopoietic lineage computationally using a custom rule. The recipe is structured as follows(*):
flowchart LR;
fetch_ngs-->atacseq_pipeline;
atacseq_pipeline-->genome_tracks;
atacseq_pipeline-->spilterlize_integrate;
spilterlize_integrate-->unsupervised_analysis;
spilterlize_integrate-->dea_limma;
dea_limma-->enrichment_analysis;
spilterlize_integrate-->crossprediction_analysis;
We use the following MrBiomics modules and one custom script:
- Fetch NGS: Downloads public bulk ATAC-seq data.
- ATAC-seq pipeline: Quantifies accessible genomics regions and determines a set of consensus regions to produce count matrices and annotations.
- Genome Browser Track Visualization: Creates genome tracks for visual quality control and analysis of genomic regions of interest.
- Split, Filter, Normalize & Integrate: Filters, normalizes, and integrates data for downstream analysis.
- Unsupervised Analysis: Explores data structure and sample similarity using PCA and UMAP.
- Differential Analysis with limma: Identifies statistically significant chromatin accessibility differences between cell types using linear modeling.
- Enrichment Analysis: Provides biomedical interpretation of differential analysis results using prior knowledge.
- Crossprediction analysis: A custom script to computationally reconstruct human hematopoietic lineage.
Compared to our RNA-seq Analysis Recipe we use two custom extra rules in CorcesATAC.smk
. They facilitate the analysis by (i) getting required resources from Zenodo (CorcesATAC_get_resources
) for the ATAC-seq pipeline and (ii) converting lists of differentially accessible consensus regions to the required BED
file format (CorcesATAC_convert_features2bed
) for enrichment analysis. We will highlight them at the corresponding step.
Templates of corresponding Methods sections for a scientific publication can be found in each module's README
.
Note
Checkpoint: You have reviewed the function of each module by visiting their respective GitHub pages. You should have a conceptual map of how raw data is transformed into biological insight.
(*)The Recipe's real rulegraph visualizes all steps and their dependencies. While its scale may seem daunting, the rulegraph powerfully illustrates the complexity and sheer number of analytical steps that are fully automated by this Recipe, offering a direct appreciation for the workflow's depth and transparency. It is best treated as a navigable blueprint: you can zoom in to trace the flow of data and dependencies, providing a detailed, step-by-step understanding of the entire workflow.
A key advantage of this recipe is that it requires no custom coding. The entire analysis is orchestrated through Snakemake using our pre-existing modules and one custom script, with all parameters controlled by simple configuration files. This approach ensures maximum reproducibility and accessibility for users with varying levels of expertise.
Important
Prerequisites: The total disk space required is approximately 745GB, and the workflow takes ~5 hours to run on a high-performance computing (HPC) cluster. Experience in Snakemake workflow usage is highly recommended.
The main components are:
-
Main configuration (
config/config.yaml
): The central file that points to all other analysis configurations in the sectionCorcesATAC
. -
Dataset/analysis-specific files (
config/CorcesATAC/
): A dedicated directory containing all annotation and configuration files for theCorcesATAC
analysis. This modular organization makes it easy to adapt the recipe for other datasets. -
Workflow logic (
workflow/rules/CorcesATAC.smk
): The dedicated Snakemake workflow file that defines the rules and connects the modules. -
Custom script (
workflow/scripts/crossprediction.py
): A Python script for lineage reconstruction using crossprediction.
Tip
Why: Separating the code (modules) from the configuration (YAML/CSV files) is a core principle of reproducible research (separation of concerns). It allows anyone to re-run the exact analysis by simply using the same configuration files, and it allows others to easily apply the same workflow to new data by creating a new set of configuration files.
Note
Checkpoint: You have located all the configuration, annotation and Snakemake files mentioned above within the project structure.
This section details each stage of the analysis, explaining the methods, key decisions, and results. We are re-analyzing bulk ATAC-seq data from Corces et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. (2016) Nature Genetics containing 47 healthy donor samples across 13 cell types of human hematopoiesis.
Figure 1 from Corces et al. introducing the dataset and underlying biology
Note
Checkpoint: You have looked at the original Corces et al. paper and understand the data's context and origin.
4.1 • Data Acquisition with fetch_ngs
Configuration: config/CorcesATAC/CorcesATAC_fetch_ngs_config.yaml
Tip
How to execute each step: Each step of this recipe is performed by a dedicated MrBiomics Module. For detailed instructions on its features and usage, please refer to the module's own documentation page always linked in the headline of the section.
We begin by downloading the raw sequencing data for the CorcesATAC
project from the Gene Expression Omnibus (GEO) under accession GSE74912 SuperSeries for the ATAC-seq profiles. We first fetch the metadata for all samples in the SuperSeries (using the metadata_only
feature) but selectively download only the 77 runs corresponding to 47 healthy donor samples, excluding leukemia samples (i.e., configuring the selected samples based on the downloaded metadata). The data is downloaded and converted to unaligned BAM (uBAM) files.
Note
Checkpoint: The results/CorcesATAC/fetch_ngs/
directory contains a metadata.csv
and 47 subdirectories, each with a *.bam
file.
4.2 • Alignment, Consensus Region Set Generation and Quantification with atacseq_pipeline
Configuration: config/CorcesATAC/CorcesATAC_atacseq_pipeline_config.yaml
Annotation: config/CorcesATAC/CorcesATAC_atacseq_pipeline_annotation.csv
Before we start, we download all required resources from Zenodo using a custom rule called CorcesATAC_get_resources
. Using the generated metadata file and the 77 uBAM files of the 47 selected samples, we quantify chromatin accessibility. The atacseq_pipeline
module aligns the reads to the human genome (GRCh38) using the Bowtie2 aligner, calls peaks using MACS2, generates a set of consensus regions, thereby creating a data specific feature space to capture any potential signals (e.g., across cell types), and finally quantifies read counts per consensus region. This process generates a count matrix (consensus regions × samples), which is the primary input for all subsequent analyses. It also produces a comprehensive consensus region annotation file (using HOMER
, UROPA
, and bedtools
) and a sample annotation file enriched with quality control (QC) metrics.
Important
Differences to RNA-seq
- For all downstream steps we will use the data-driven feature space i.e., consensus region set, with the respective count matrix and annotation. This represents the assay inherent measurement, which many find hard to analyze and intepret.
- Additionally to the consensus region count matrix, a promoter and a TSS count matrix is generated. Both are "gene-level" quantifications, which are easier to interpret (for details see the module docs). For their analysis we refer to our RNA-seq Analysis Recipe. During result interpretation keep in mind how the matrices were generated and that chromatin accessibility was quantified (not gene expression).
- When writing “regions” we will always refer to "genomic consensus regions".
- Because the consensus region set (i.e. feature space) is data-driven we recommend a two-step usage of the
atacseq_pipeline
. First assess and select high-quality samples to be included, and only then continue with the consensus region set generation and all other downstream steps as described here.
During a QC step, sample NKcell_1022
(SRR2920495) caused an error because all its reads had a quality score of 0
and were filtered out by fastp
. Therefore, we removed it from the sample annotation sheet and thereby our analysis. This leaves a final dataset of 46 high-quality samples for downstream analysis.
samtools view results/CorcesATAC/fetch_ngs/SRR2920495/SRR2920495.bam | head -n 5
SRR2920495.1 77 * 0 0 * * 0 0 CATGANTTACTTGAAGCTGCAAGTAGCAAGNCCAGAAATATCTATAACAGTNNNNNNNNNNNNNNNNNNNNANNAG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! RG:Z:A
SRR2920495.1 141 * 0 0 * * 0 0 NAGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! RG:Z:A
SRR2920495.10 77 * 0 0 * * 0 0 GCACCNGACTAATCTCTGAACTTTTGATTCCCCATGTGTGAAATGGGATAATAATAATACCTGTCTCTTATACACA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! RG:Z:A
SRR2920495.10 141 * 0 0 * * 0 0 NTATTATTATTNTNCNANNTNNNANANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! RG:Z:A
SRR2920495.100 77 * 0 0 * * 0 0 CACTGGAAAAGCTCAGTGGTCCAGGAGAATGAGAGCCTGTCACATATACACAACACCGAGCACCCCAGCCTGGCTA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! RG:Z:A
A MultiQC report and a heatmap visualizing QC metrics with matching metadata of all samples are generated, aggregating QC metrics from all tools. Given that this is a re-analysis of a published, high-quality dataset, we retain all 46 samples although some would not pass our standard QC thresholds (e.g., >90% alignment rate, >20M mapped reads). Clustering on QC metrics reveals a small donor-specific batch effect that must be accounted for downstream.

Hierarchically clustered QC heatmap with matching metadata annotation
Note
Checkpoint: The results/CorcesATAC/atacseq_pipeline/counts/
directory contains three key files: consensus_counts.csv
, sample_annotation.csv
, and consensus_annotation.csv
. The MultiQC report in results/CorcesATAC/atacseq_pipeline/report/
shows QC metrics across all 46 samples. The removed sample NKcell_1022
is absent from the sample annotation file and results.
4.3 • Visual Quality Control with genome_tracks
Configuration: config/CorcesATAC/CorcesATAC_genome_tracks_config.yaml
Annotation: config/CorcesATAC/CorcesATAC_genome_tracks_annotation.csv
For an intuitive quality check, we visualize the aligned data as genome browser tracks. We use the genome_tracks
module to generate coverage plots for 22 known hematopoietic marker genes (e.g., CD34 for stem cells, MS4A1-aka CD20-for B-cells). Samples are grouped by cell type, so all replicates for a given cell type are merged into a single track.
This visualization serves as a critical sanity check. For example, we expect to see high signal (i.e., accessibility) for the MS4A1 gene in the Bcell
track and low signal elsewhere. Confirming these known patterns gives us confidence in the cell type labels and the overall quality of our data.
CD34 (stem-ness) | MS4A1 (aka CD20; B cells) |
---|---|
Note
Checkpoint: The directory results/CorcesATAC/genome_tracks/tracks/
contains SVG images for all 22 marker genes. Visual inspection confirms that accessibility patterns align with known hematopoietic biology.
4.4 • Filtering, Normalization and Integration with spilterlize_integrate
Configuration: config/CorcesATAC/CorcesATAC_spilterlize_integrate_config.yaml
Annotation: config/CorcesATAC/CorcesATAC_spilterlize_integrate_annotation.csv
Raw counts from ATAC-seq, similar to RNA-seq, require filtering of lowly accessible regions, normalization to account for technical artifacts like library size or region length/GC-content and integration to address potential batch effects. The spilterlize_integrate
module is used for these crucial pre-processing steps.
Tip
General analysis strategy for large data sets with complex design
- Generate a consensus region set for all QC'd samples to capture all potential signals.
- Split your data accordingly for each analysis (e.g., cell type A analysis -> subset to cell type A samples).
- Remove lowly accessible regions in each subset to improve statistical power and reduce statistical estimation issues.
- Perform the desired analysis (e.g., differential accessibility analysis) on the processed subset datasets.
-> This approach facilitates integrating common signals across subsets while not losing subset-specific signals.
Region filtering is controlled by multiple parameters, of which min.count
has the biggest impact. We filter quite harshly to reduce the number of regions from 294,386
to more managable 61,732
. A key diagnostic, the mean-variance plot, subsequently reveals no upward trend among lowly accessible regions, confirming that the filtering effectively removes noise prior to normalization, a critical prerequisite for reliable linear modeling downstream.
Our confounding factor analysis (CFA) reveals a significant association between Principal Component 1 (PC1) and multiple technical covariates (e.g., peaks
) and a biological covariate donor
. We do not really see a strong effect of GC content but the consensus regions exhibit strongly varying lenghts. We apply multiple normalization methods, compare their results and identify Conditional Quantile Normalization (CQN) as yielding the least confounded results. CQN is specifically designed to correct for length biases, in addition to general normalization (e.g., by library size). CQN often works well for ATAC-seq data, but we recommend trying multiple methods and comparing their results with each other.
Tip
Why CQN? Standard normalization methods like CPM or TMM correct for library size but not for feature-specific biases. CQN addresses this directly, leading to a cleaner signal for biological discovery. Here, our TMM results are comparable, but CQN removes slightly more unwanted effects.
After CQN, we use limma
's removeBatchEffect
function to integrate the data, modeling cell_type
as the variable of interest and removing the donor
effect as an unwanted batch covariate and peaks
as an unwanted numerical covariate. The post-integration CFA confirms that cell_type
is now the dominant source of variation.
Important
Always inspect diagnostic plots after each step: After every data transformation—be it splitting, filtering, normalization, or integration—you must inspect the generated diagnostic plots in the report. Pay special attention to the mean-variance plots, PCA plots, and the Confounding Factor Analysis (CFA). This is the only way to visually confirm the impact of each method and to make informed decisions for subsequent analyses.
filtered |
normCQN_integrated |
---|---|
![]() |
![]() |
filtered |
normCQN_integrated |
---|---|
![]() |
![]() |
Note
Checkpoint: All configured processing step result files in results/CorcesATAC/spilterlize_integrate/all/
have been generated. The associated diagnostic plots show that sample distributions are well-aligned and that confounding factors, such as the peaks
or donor
effect have been successfully mitigated.
4.5 • Exploratory Data Analysis with unsupervised_analysis
Configuration: config/CorcesATAC/CorcesATAC_unsupervised_analysis_config.yaml
Annotation: config/CorcesATAC/CorcesATAC_unsupervised_analysis_annotation.csv
With a clean, normalized data matrix, we use the unsupervised_analysis
module to explore its structure. Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) are applied to visualize the relationships between samples.
The resulting plots show that samples cluster cleanly by their annotated cell type and the PCA plot's already clearly show the lineage. We observe that progenitor cells (e.g., HSC, MPP) cluster closely together, while more differentiated cell types (e.g., Erythrocytes, B-cells) form distinct, separate groups. This confirms that our processing has preserved the underlying biological structure. Additionally, we can compare our results to the published results shown here. For efficiency in later steps, we have also generated in the previous step a dataset containing only the top 10% most highly variable features (HVFs), which also sufficiently capture the cell type characteristics and lineage.
normCQN_integrated |
normCQN_integrated_HVF |
---|---|
![]() |
![]() |
normCQN_integrated |
normCQN_integrated_HVF |
---|---|
![]() |
![]() |
Having confirmed that the CQN normalized data provides the clearest biological signal, with samples clustering distinctly by cell type, we will use this dataset for the next step: identifying cell-type-specific genomic regions. The previously observed donor
and peaks
effects will be explicitly modeled as covariates within the limma
differential accessibility analysis.
Note
Checkpoint: Interactive HTML plots in results/CorcesATAC/unsupervised_analysis/normCQN_integrated{_HVF}/{PCA|UMAP}/plots/
allow for interactive exploration of the UMAP/PCA embeddings. The clustering of samples by cell_type
is visually evident.
4.6 • Differential Accessibility Analysis with dea_limma
Configuration: config/CorcesATAC/CorcesATAC_dea_limma_config.yaml
Annotation: config/CorcesATAC/CorcesATAC_dea_limma_annotation.csv
To find cell-type-specific genomic regions, we perform a One-vs-All (OvA) differential accessibility analysis (=DAA) using the dea_limma
module. For each of the 13 cell types, we compare it against the average of all other cell types. This is achieved automatically by the module using limma
's contrast functionality based on the linear model ~ 0 + cell_type + donor + peaks
. As we want to use the CQN
normalized data as input to our modeling, i.e., where we previously removed technical artifacts, we leverage the limma-trend
workflow (instead of the limma-voom
workflow for raw counts).
Note
Why do we model normalized counts but investigate integrated data?
You might notice a seeming contradiction: we use the CQN-normalized counts as input for the limma
model (which explicitly models the donor
and peaks
effects), yet we spent the previous sections investigating and visualizing the fully integrated data (where the donor
and peaks
effects were explicitly removed).
This is intentional and represents a best practice:
-
Linear modeling: Statistical models like
limma
are most powerful when they work with data that is as close to the raw counts as possible, while accounting for covariates (e.g.,donor
orpeaks
) mathematically within the model itself. - Investigation & visualization: To "see" what the model sees and to confirm that our correction strategy is effective, we must explicitly remove the confounding effects to create an integrated dataset. This is the only way to investigate (with CFA) and visually inspect the data's structure (e.g., with PCA/UMAP) after accounting for all known confounders.
In short, we investigate & visualize the integrated data to validate our approach, but we model the normalized data to achieve the most robust statistical results.
This approach identifies regions that are more open or closed in a specific cell type relative to the rest of the hematopoietic system. The results are visualized as volcano plots per cell type or heatmaps across all cell types, where we can confirm that known marker genes are among the significantly differentially accessible regions (=DARs) for their respective cell types. Genomic regions are annotated with their respective gene. Hence, duplicate gene labels can exist in the visualizations, they refer to unique consensus regions located within the same gene.

Bar plot of significantly differentially accessible regions split by directionality

Heatmap of marker genes' DAA statistics across cell types
Note
Checkpoint: The directory results/CorcesATAC/dea_limma/normCQN_OvA_cell_type/feature_lists/
contains 26 files (e.g., Mono_{up|down}_features.txt
), one for each comparison (i.e., cell type) and direction (up/down), listing all regions that passed the filtering criteria, for significantly differential accessibility, for downstream enrichment analysis.
4.7 • Biological Interpretation with enrichment_analysis
Configuration: config/CorcesATAC/CorcesATAC_enrichment_analysis_config.yaml
Annotation: config/CorcesATAC/CorcesATAC_enrichment_analysis_annotation.csv
Before we continue, we need to convert our filtered lists of differentially accessible consensus regions to the required BED
format using a dedicated rule called CorcesATAC_convert_features2bed
. To characterize and understand the biological functions underlying our cell-type-specific regions, we perform enrichment analysis. Using the genomic region feature lists from the previous step, the enrichment_analysis
module performs an over-representation analysis (ORA) on the "more open" (i.e., up
) regions of each cell type using GREAT, a tool to test for functional enrichment of genomic regions. We query the Azimuth and Reactome databases to confirm cell type specificity and find biological processes or pathways that are significantly enriched among the differentially open regions for each cell type.
The results are aggregated and visualized as a bubble-heatmap, showing the top enriched terms for each cell type comparison. This provides a high-level biological interpretation of what differentiates or unites cell types. Interestingly, the enrichment results for differentiated cells are as expected, but the progenitors are enriched in neuronal terms and do not show distinctive patterns. This might have mutliple reasons such as missing (modality/biology-specific) information in the databases, which is an inherent limitation of enrichment analyses.
Azimuth 2023 | Reactome |
---|---|
![]() |
![]() |
Azimuth 2023 | Reactome |
---|---|
![]() |
![]() |
Note
Checkpoint: Summary plots and tables are available in results/CorcesATAC/enrichment_analysis/cell_types_up/
. The enrichment results are biologically plausible and align with the known functions of differentiated hematopoietic cell types, but are not as clear for progenitors.
Custom script: workflow/scripts/crossprediction.py
Finally, to demonstrate the power of adding custom code to a modular workflow, we perform machine learning-based lineage reconstruction. We use a custom Python script (crossprediction.py
) that trains a multinomial logistic regression classifier with elastic-net regularization for each cell type on the normalized and integrated highly variable feature dataset normCQN_integrated_HVF
.
The script employs a Leave-One-Group-Out (LOGO) cross-prediction strategy to identify similarities between groups: to predict relationships for one cell type, it is held out, and a model is trained on all other cell types. The prediction probabilities of the held-out samples are aggregated into a similarity matrix. This (adjacency) matrix, when visualized as a graph, computationally reconstructs the human hematopoietic lineage, closely matching the known biological hierarchy.

Crossprediction graph reconstruction of human hematopoietic lineage by similarity
We additionally leveraged the interpretability of logistic regression to identify the top five most positive predictive features (here regions annotated with genes) for every class (here cell type).
Bcell | CD4Tcell | CD8Tcell | CLP | CMP | Ery | GMP | HSC | LMPP | MEP | MPP | Mono | NKcell |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CBLB | CCR4 | CD8A | GARIN6 | LOC401478 | LINC01762 | LINC00607 | GRIN2A | PRDM5 | CNTNAP2 | PRG4 | LINC01370 | TRD-AS1 |
LINC00158 | WDR41 | TPST2 | VAV3 | SEMA4D | TMT1A | KCNK13 | TMEM131L | VSX1 | TYRP1 | PLXNA4 | MIR125B1 | STARD4 |
BHLHE41 | ADAM23 | KLRK1 | OTOL1 | DNAJC1 | DENND2B-AS1 | MIR28 | THAP3P1 | LINC01374 | GALNT17 | KIFAP3 | KCNK13 | LINC02760 |
SNORA70F | NLN | MAGI2-AS3 | BAALC-AS1 | NEDD9 | HECTD4 | LOC401478 | LINC02737 | MIR8054 | LINC02501 | RINL | CCL20 | FAM133B |
LINC01350 | CHRM3 | ROBO1 | AKAP12 | MIR3680-2 | LINC01845 | OSBPL8 | FNDC3A | LOC101927082 | NFKB1 | MIR6507 | MIR4790 | SLC9A9-AS1 |
Note
Checkpoint: The adjacency matrix is saved to results/CorcesATAC/special_analyses/crossprediction/adjacency_matrix.csv
. When visualized, the resulting graph largely reflects the hierarchy of the human hematopoietic lineage.
Important
Explore the full interactive report:
This wiki page highlights key findings, but many more results are available for exploration. For a deeper dive, please see the complete interactive Snakemake report, where all outputs specific to this recipe are prefixed with CorcesATAC_*
.
Even with a robust workflow, several issues can arise. Awareness of these potential problems can save significant time and prevent incorrect conclusions.
-
Poor Data Quality: As seen with run
SRR2920495
poor sample quality is not uncommon in biology. Low-quality reads can severely impact results. Always closely investigate QC results and remove outlier samples before proceeding with downstream analysis. - Resource Management: This analysis requires significant computational resources (~750GB disk space, ~5 hours on an HPC). Ensure adequate resources are available before starting. Running out of disk space mid-workflow can lead to downstream problems.
- Incorrect Normalization or Model: Choosing the wrong normalization method or model can fail to remove technical artifacts or worse, remove real biological signal. The confounding factor analysis is a critical diagnostic for choosing an appropriate method like CQN and inform the modeling approach.
- Limitation of ML interpretation: The computational lineage reconstruction demonstrates how easy it is to build your custom analyses on top of a modular workflow's results, but its simple interpretability approach has limitations. We identify features (here genomic regions annotated with genes) that are positive predictors but will miss features whose importance lies in being absent (i.e., more closed).
Note
Checkpoint: Before starting a new analysis, verify data quality, resource availability, and critically question the rationale of your analysis choices.
This section provides answers to common questions.
-
Why was Conditional Quantile Normalization (CQN) used? CQN performed the best with regards to the removal of technical covariates as seen in the confounding factor analysis results. It is specifically designed to correct for feature-level biases such as GC content and feature length, in addition to library size, of which the latter exhibits a high variability when working with genomic consensus regions.
-
What is the purpose of the One-vs-All (OvA) differential accessibility analysis? The OvA strategy is ideal for finding a signature for each group in a multi-class experiment. By comparing each cell type to the average of all others, we identify regions that make that cell type distinct within the context of the entire system.
-
Why were the
donor
andpeaks
variables included in thelimma
model? The data comes from multiple human donors, which can introduce biological and technical variability (a "batch effect"). Depending on sample quality a different number of peaks can be called per sample. Both were found to significantly confound the data. Therefore, includingdonor
andpeaks
in the model (~ 0 + cell_type + donor + peaks
) allowedlimma
to estimate and account for their effect, increasing the statistical power to detect differences between cell types.
Important
Feedback, questions, or comments are very welcome, you can reach me via LinkedIn.