Figure 1 - RegnerM2015/scBreast_scRNA_scATAC_2024 GitHub Wiki
scRNA-seq processing
QC, doublet removal, and pre-processing for each scRNA-seq sample:
For each sample, the filtered feature barcode matrix (generated by Cell Ranger) was read into the Seurat R package for QC, doublet removal, and pre-processing (normalization, feature selection, dimensionality reduction, etc.).
The Slurm and R scripts below were used to run each sample through the same QC, doublet removal, and pre-processing pipeline:
-
scripts/Sumbit-Individual_Samples_scRNA-QC_DoubletRemoval_Preprocessing.sh (
SumbitSubmit) -
scripts/Individual_Samples_scRNA-QC_DoubletRemoval_Preprocessing.R
Clustering cells within each scRNA-seq sample:
After QC, doublet removal, and pre-processing for each sample, clustering was performed using the MultiK R package. Briefly, this method applies Seurat's clustering methods over multiple cluster resolution parameters.
The Slurm and R scripts below were used to cluster the cells within each sample:
- scripts/Submit-Individual_Samples_scRNA-MultiKClustering.sh
- scripts/Individual_Samples_scRNA-MultiKClustering.R
Note that for some samples, MultiK clustering could not run to completion due to a convergence issue, therefore, these samples were re-run with a different seed. This procedure was repeated until MultiK clustering was successfully run for all remaining samples.
The Slurm and R scripts below were used to cluster the cells within each remaining sample (using different seeds):
-
scripts/Submit-Individual_Samples_scRNA-MultiKClustering_AlternateSeed.sh
-
scripts/Individual_Samples_scRNA-MultiKClustering_AlternateSeed.R
-
scripts/Submit-Individual_Samples_scRNA-MultiKClustering_AlternateSeed_SecondAttempt.sh
-
scripts/Individual_Samples_scRNA-MultiKClustering_AlternateSeed_SecondAttempt.R
-
scripts/Submit-Individual_Samples_scRNA-MultiKClustering_AlternateSeed_ThirdAttempt.sh
-
scripts/Individual_Samples_scRNA-MultiKClustering_AlternateSeed_ThirdAttempt.R
Finding cluster marker genes within each scRNA-seq sample:
After clustering cells in each scRNA-seq sample, we performed differential gene expression analysis within each dataset using Seurat's FindAllMarkers
function to identify upregulated genes within each cluster.
The Slurm and R scripts below were used to perform the differential gene expression analysis for each scRNA-seq dataset:
- scripts/Submit-Individual_Samples_scRNA-FindClusterMarkerGenes.sh
- scripts/Individual_Samples_scRNA-FindClusterMarkerGenes.R
Re-processing the Patient 11 scRNA-seq sample:
There were three clusters identified in the Patient 11 scRNA-seq sample that contained cells with an abnormally low mapping rate of reads (flagged by Cell Ranger). Therefore, these clusters were removed from this sample's scRNA-seq dataset.
The Slurm and R scripts below were used to remove the low mapping rate clusters, followed by re-processing (normalization, feature selection, dimensionality reduction, etc.) and re-clustering of the Patient 11 scRNA-seq dataset:
- scripts/Individual_Samples_scRNA-RemoveLowMappingRatePopulation_Reprocess_3FCDEL.sh
- scripts/Individual_Samples_scRNA-RemoveLowMappingRatePopulation_Reprocess_3FCDEL.R
Cell type annotation within each scRNA-seq patient sample:
To help annotate the clusters identified within each scRNA-seq patient sample (excluding the cell line samples), we used Seurat's CCA-based label transfer procedure with each patient sample as the 'query' dataset and a large breast cancer scRNA-seq dataset as the 'reference' dataset.
First, we downloaded this reference scRNA-seq dataset from GSE176078 (Wu et al.) using the following Slurm script:
We next processed the reference scRNA-seq data in Seurat with and without CCA:
- scripts/Wu_etal_2021_BRCA_scRNA-CreateSeuratObjectWithCCA.sh
- scripts/Wu_etal_2021_BRCA_scRNA-CreateSeuratObjectWithOutCCA.sh
Next, for each scRNA-seq patient dataset, we performed the CCA-based label transfer to annotate the clusters based on their majority predicted label. Cluster annotations were verified by calculating cell type-specific gene signature enrichment scores (sourced from PanglaoDB) using Seurat's AddModuleScore
function.
The Slurm and R scripts below were used to carry out these cell type annotation procedures for each scRNA patient dataset:
- scripts/Submit-Individual_Samples_scRNA-CellTypeAnnotation.sh
- scripts/Individual_Samples_scRNA-CellTypeAnnotation.R
Cancer cell prediction within each scRNA-seq breast cancer (BC) patient sample:
As performed by Wu et al., we used inferCNV to estimate CNV scores for the epithelial cells identified in each BC patient sample. Essentially, this procedure classified each epithelial cell into one of three groups: 'inferCNV high', 'ambiguous', or 'inferCNV low'. Cells classified as 'inferCNV high' were deemed putative cancer cells.
The Slurm and R scripts below were used to run inferCNV for each scRNA-seq BC patient sample and classify epithelial cells as 'inferCNV high', 'ambiguous', or 'inferCNV low':
- scripts/Submit-Individual_Samples_scRNA-inferCNV_CancerCellDetection.sh
- scripts/Individual_Samples_scRNA-inferCNV_CancerCellDetection.R
Molecular subtype prediction for putative cancer cells identified in each scRNA-seq BC patient sample:
We next predicted the molecular subtype (Basal, Her2-enriched, Luminal A, or Luminal B) of each putative cancer cell using the SCSubtype method described by Wu et al.. Essentially, this procedure calculates subtype-specific gene signature enrichment scores for each putative cancer cell and classifies each cell as Basal, Her2-enriched, Luminal A, or Luminal B, based on the highest signature enrichment score for each cell.
The Slurm and R scripts below were used to source the SCSubtype training data and run the SCSubtype algorithm on the putative cancer cells identified in each scRNA-seq BC patient sample:
- scripts/get_SCSubtype_training_data.sh
- scripts/Submit-Individual_Samples_scRNA-SCSubtype_Classification.sh
- scripts/Individual_Samples_scRNA-SCSubtype_Classification.R
Merging all patient scRNA-seq samples into one multi-patient scRNA-seq dataset:
To analyze the scRNA-seq cells from all patient samples, we merged the Seurat objects from all patient samples into one multi-sample/multi-patient Seurat object. This multi-sample scRNA-seq dataset was re-processed (normalization, feature selection, dimensionality reduction, etc.) and clustered. The resulting clusters were annotated based on their majority cell type label (derived from the cell type cluster labels annotated in each patient dataset).
The Slurm and R scripts below were used to merge Seurat objects from all patient samples, re-process (normalization, feature selection, dimensionality reduction, etc.), and perform clustering of the multi-sample scRNA-seq dataset:
- scripts/Patient_Samples_scRNA-Merge_And_ReCluster.sh
- scripts/Patient_Samples_scRNA-Merge_And_ReCluster.R
scATAC-seq processing and integration with scRNA-seq
QC and doublet removal for each scATAC-seq sample:
The ATAC fragments files (generated by Cell Ranger ATAC) from all samples were read into the ArchR R package for QC and doublet removal.
The Slurm and R scripts below were used to run each sample through the same QC and doublet removal pipeline:
- scripts/All_Samples_scATAC-QC_DoubletRemoval_Preprocessing.sh
- scripts/All_Samples_scATAC-QC_DoubletRemoval_Preprocessing.R
Creating the multi-patient scATAC-seq dataset and performing label transfer from scRNA-seq:
To analyze the scATAC-seq cells from all patient samples, we screened for scATAC-seq cells derived from patient samples (excluding cell line samples).
We then carried out dimensionality reduction and gene scoring before transferring cell type cluster labels from the multi-patient scRNA-seq dataset using Seurat's CCA-based cross-modality integration approach. Note that this strategy also enabled us to annotate the inferCNV status and predicted subtype of each scATAC-seq cell based on the annotations of its nearest neighboring cell in scRNA-seq.
The Slurm and R scripts below were used to screen for scATAC-seq patient cells, followed by dimensionality reduction, gene scoring, and label transfer from scRNA-seq:
- scripts/Patient_Samples_scATAC-Subset.sh
- scripts/Patient_Samples_scATAC-Subset.R
- scripts/Patient_Samples_scATAC-DimReduc_GeneScoring.sh
- scripts/Patient_Samples_scATAC-DimReduc_GeneScoring.R
- scripts/Patient_Samples_scATAC-Transfer_Labels_from_scRNA.sh
- scripts/Patient_Samples_scATAC-Transfer_Labels_from_scRNA.R
Figure 1 plotting
To visualize the multi-patient scRNA-seq and scATAC-seq datasets, we plotted the UMAP plots for each, color-coded by cell type cluster and patient sample of origin.
In addition, we generated the proportion bar charts showing the composition of each cell type cluster in scRNA-seq and each inferred cell type cluster in scATAC-seq (in terms of patient makeup, inferCNV status, and predicted subtype).
Moreover, we performed a pseudo-bulk transcriptome clustering analysis to infer the nearest normal mammary epithelial cell types for the Basal and Luminal BC subtypes (shown in Supplement).
The Slurm and R scripts below were used to perform the pseudo-bulk clustering analysis, shown in Supplement, and generate the UMAP plots as well as the proportion bar charts shown in main Figure 1: