Figure 2 - RegnerM2015/scENDO_scOVAR_2020 GitHub Wiki
Part I: Peak-to-gene correlation analysis with empirically-derived false discovery rate (eFDR)
In order to link variation in chromatin accessibility to differences in gene expression, we carried out a large-scale peak-to-gene linkage analysis, available in the ArchR suite of programs, and introduced a sophisticated empirical false discovery rate (eFDR) procedure for determining statistically significant peak-to-gene associations in single-cell data (Granja et al., 2021; Storey and Tibshirani, 2003). While this approach offers similar results to the standard Benjamini-Hochberg procedure, we are able to more clearly demonstrate that peak accessibility is a significant factor in determining inferred gene expression as shown in the correlation and p-value histograms below:
The starting input for this workflow is the multi-sample ArchR Project (including cells from all 11 patients) saved as an rds object that contains 1) a 500 bp genomic tile matrix, 2) an estimated gene activity matrix, 3) an inferred gene expression matrix, and 4) a peak matrix. We generated this multi-sample ArchR Project using /scATAC-seq Processing Scripts/Full_Cohort/Patients1-11_scATAC-seq.R presented on the Figure 1 page. The output of this script is a new ArchR project that contains all peak-to-gene associations including those that may be statistically insignificant along with a heatmap showing that distal peak accessibility is tightly linked to inferred gene expression. We also write out a peak-to-gene metadata table that lists the peak coordinates, peak type (distal, promoter, intronic,exonic), gene name, correlation value, p-value, variance measures, etc.
/PeaktoGeneLink_Analysis/Full_Cohort/FullCohort_P2G.R
- Run peak-to-gene correlation analysis under the permuted null condition 100 times record how many associations have a p-value <= alpha threshold
- Run peak-to-gene correlation analysis with the observed data and record how many associations have a p-value <= alpha threshold
- Screen for statistically significant distal peak-to-gene links <= alpha threshold with the estimated eFDR
- Plot distal peak-to-gene heatmap
Part II: Screening for cancer-specifc peak-to-gene links
The heatmap of distal peak-to-gene links allowed us to visualize which peak-to-gene links were enriched in the cancer cell populations. To determine which of the cancer-enriched distal peak-to-gene links are "cancer-specific," we carried an elaborate genomic coordinate overlap analysis against a series of normal reference epigenomic profiles. H3K27ac (putative enhancer mark) peaks measured from the normal ovarian surface epithelium and fallopian tube secretory epithelium were downloaded from GSE68104. We also overlapped the cancer-enriched peaks against all ENCODE regulatory elements (cCREs) which includes elements active in normal epithelial tissue. Cancer-enriched peaks that did not overlap with any of these profiles were deemed "cancer-specific" and their corresponding gene(s) established a set of putative cancer-specific regulatory relationships.
We used the following code to generate this set of reference epigenome profiles from the ovarian surface epithelium and Fallopian tube secretory epithelium:
library(GenomicRanges)
library(liftOver)
library(rtracklayer)
library(genomation)
library(plyranges)
# Make Ovarian epithelial peakset
files <- list.files(pattern = "iOSE")
grObject.1 <- readNarrowPeak(files[1])
grObject.2 <- readNarrowPeak(files[2])
grObject.3 <- readNarrowPeak(files[3])
grObject.4 <- readNarrowPeak(files[4])
grObject <- unlist(GenomicRanges::reduce(GRangesList(grObject.1,grObject.2,
grObject.3,grObject.4)))
# Liftover:
chainObject = import.chain("hg19ToHg38.over.chain")
results <- as.data.frame(liftOver(grObject, chainObject))
saveRDS(results,"Ovarian_Epithelial_Cell_line_Peaks.rds")
# Make fallopian tube peakset
files <- list.files(pattern = "iFT")
grObject.1 <- readNarrowPeak(files[1])
grObject.2 <- readNarrowPeak(files[2])
grObject.3 <- readNarrowPeak(files[3])
grObject.4 <- readNarrowPeak(files[4])
grObject <- unlist(GenomicRanges::reduce(GRangesList(grObject.1,grObject.2,
grObject.3,grObject.4)))
# Liftover:
chainObject = import.chain("hg19ToHg38.over.chain")
results <- as.data.frame(liftOver(grObject, chainObject))
saveRDS(results,"Fallopian_Tube_Cell_line_Peaks.rds")
The script introduced above, /PeaktoGeneLink_Analysis/Full_Cohort/FullCohort_P2G.R, also performs this overlap analysis after computing the peak-to-gene correlations.
Figure 2 plotting
The starting inputs for this script are 1) the multi-sample (full cohort) Seurat object and 2) the multi-Sample ArchR project with all peak-to-gene associations, and 3) the peak-to-gene link metadata table generated in the previous script. The outputs are the ATAC browser track per cluster for the gene locus of interest, the expression boxplot per cluster, the gene signature boxplot per cluster, the histograms showing the distributions of correlation and p-values, and the histograms showing the distribution of genes per peak and vice versa.
/Figure_2/Figure_2.R
- Plot histograms of correlation values and p-values for both observed and null condition
- Plot bar charts for peaks per gene and genes per peak information
- Plot scATAC-seq browser track for gene of interest and corresponding scRNA-seq expression in box plot
- Plot box plot for pathway expression signature calculated by Seurat's AddModuleScore()
Interested in more exciting research in cancer genomics? Visit https://www.thefrancolab.org/ to learn more!