Figure 2 - RegnerM2015/scBreast_scRNA_scATAC_2024 GitHub Wiki

scRNA-seq processing

To analyze the Basal-like BC cells from Patients 5 and 6 as well as the luminal progenitor and basal epithelial cells from true normal Patients 1-4, we subsetted these cells from the multi-patient scRNA-seq dataset shown in main Figure 1. The resulting subset was re-processed (normalization, feature selection, dimensionality reduction, etc.) and clustered.

The Slurm and R scripts below were used to perform this subsetting operation followed by re-processing and clustering:

scATAC-seq processing and integration with scRNA-seq

To analyze the Basal-like BC cells from Patients 5 and 6 as well as the luminal progenitor and basal epithelial cells from true normal Patients 1-4, we subsetted these cells from the multi-patient scATAC-seq dataset shown in main Figure 1. After subsetting, we performed dimensionality reduction to carry out the integration with its matching scRNA-seq dataset. More specifically, we again used Seurat's cross-modality integration approach to not only transfer cluster labels from scRNA to scATAC but to also infer gene expression profiles for each cell in scATAC-seq. After the integration procedure, we called peaks in the scATAC-seq data to identify possible regulatory elements located in regions of accessible chromatin.

The Slurm and R scripts below were used to perform subsetting, dimensionality reduction, integration with scRNA-seq, and peak calling:

Differential peak-to-gene association analysis

To perform the differential peak-to-gene association analysis between Basal-like BC and normal luminal progenitor cells, we first created metacells for each patient (i.e. aggregates of similar scATAC-seq cells) to use as more informative observations for the association analyses. As performed in the ArchR R package, peak accessibility and gene expression were summarized for each metacell by summing peak counts and inferred gene expression values across the constituent scATAC-seq cells within each metacell. The resulting metacell gene expression and peak accessibility matrices were normalized to counts per 10,000 and log2-transformed with a pseudo-count of 1.

The first phase of the differential peak-to-gene association analysis involved performing independent peak-to-gene association analyses for each condition using the following linear mixed-effects model for every peak located within 500 kb of each gene:

gene expression ~ peak accessibility + (1|patient)

In the second phase, the intronic/distal peak-to-gene associations with a significant effect size in at least one condition were carried forward to quantify the change in effect size between conditions using an interaction term in the same linear mixed-effects model:

gene expression ~ peak accessibility + condition + peak accessibility:condition + (1|patient)

The Slurm and R scripts below were used to generate the metacells for each patient and perform both phases of the differential peak-to-gene association analysis:

Differential gene expression, differential peak accessibility, and Figure 2 plotting

To identify differentially expressed genes in scRNA-seq and differentially accessible peaks in scATAC-seq, within the Basal-like subtype analysis, we performed these tests on a pseudo-bulk scale (to overcome pseudoreplication bias) using the DESeq2 R package.

To visualize the scRNA-seq and scATAC-seq cells in the Basal-like subtype analysis, we plotted the UMAP plots for each, color-coded by cell type cluster and patient sample of origin.

Next, we visualized the results from the first phase of the differential peak-to-gene association analysis by generating proportion bar charts showing the genomic distribution and ENCODE annotation status for normal-specific, cancer-specific, and shared peak-to-gene associations.

To visualize the results from the second phase of the differential peak-to-gene association analysis, we plotted the significant differential peak-to-gene associations (significant change in effect size between conditions) in a scatter plot of their effect sizes in the cancer condition by their effect sizes in the normal condition.

The Slurm and R scripts below were used to perform the differential gene expression and peak accessibility testing as well as generate the UMAP plots, proportion bar charts, and scatter plots: