Single cell ATAC seq - settylab/single-cell-primers GitHub Wiki
Introduction
Single-cell ATAC-seq like scRNA-seq is a great technological advance and empowers sophisticated modeling of gene regulatory relationships at an unprecedented resolution. As with scRNA-seq, we typically use the 10X Chromium system and a brief description of how single-cell ATAC-seq works can be found here.
Similar to scRNA-seq, a number of key issues have to be taken into iaccount while undertaking analysis. The pervasive dropout issue in scRNA-seq is even more exaggerated in scATAC-seq since (a) the number of accessible regions is orders of magnitude higher than the number of genes and (b) for any given locus, the dynamic range of accessibility is limited to 0, 1 or 2. As a result, scATAC data is significantly more sparse compared to scRNA-seq and necessitates use of different preprocessing and normalization techniques.
Another important factor to consider in scATAC analysis is the fragment size distribution. ATAC fragment distribution typically show multiple modes as shown below. The first mode represents nucleosome-free regions typically regions bound TFs, second modes show mono-nucleosome regions and so on. While TFs can bind to nucleosome occupied regions, the noise in the data means that peak calling and motif identification works best with using reads from nucleosome-free regions.
NOTE: Unlike scRNA-seq which produce cell x gene matrices, scATAC analysis produce cell X peak or cell X var feature matrices and thus multiple samples cannot be combined in a straightforward manner. Plan your analysis carefully when dealing with multiple samples!
ArchR preprocessing
We extensively use ArchR package for scATAC preprocessing, starting from the fragments file generated by CellRanger ATAC. Cellranger ATAC results for 10X PBMC ATAC dataset are here.
TF-IDF normalization
ArchR first tiles the reads into 500 base bins to create a cells X Bin binary count matrix. Data is then normalized using TF-IDF. TF-IDF is a normalization procedure developed for NLP and has two steps: First, the counts for each cell are normalized by the total counts (similar to scRNA-seq normalization), forming the Term Frequency (TF) component. The transformed counts in each bin are then weighed by inverse of number of cells the bin has a non-zero count. This step down weights genome bins that are detected across large number of cells and upweights rarely detected bins.
TF-IDF normalized count matrix is then used as input for clustering. ArchR then creates an aggregate ATAC profile for each cluster and performs a feature selection step and identifies the user specified number of highly variable features across the clusters. The subset of features selected now forms the working matrix which is used for downstream analysis. This iterative procedure is termed Iterative LSI.
SVD
The normalized matrix used as input to SVD, which serves as equivalent operation of the PCA in scRNA-seq. Note: despite normalization, the first SVD component typically is strongly correlated with read depth of the cell and should be excluded from the analysis. ArchR automatically excludes this component.
Practical application
We use a modified version of ArchR for our analysis:
- 100k variable features instead of the 25k recommended by ArchR: This can dramatically alter the results in many instances.
- We use only nucleosome-free region (NFR) fragments for peak calling (using MACS2).
- Variable features are used for gene activity computation. Gene activities summarize the accessibility to generate a cells X genes matrix The modified version of ArchR is available here.
Analysis
While ArchR also includes analysis methods, our experience in many instances shows that recomputing the visualizations and clustering using the SVD
output from ArchR gives better results. Downstream analysis and visualization of scATAC data follows the same steps as scRNA-seq post PCA.
UMAPs and force directed layouts colored by leiden and phenograph clusters for the 10X PBMC dataset is shown below
Motif identification
FIMO is used for motif identification from MACS2 peaks. cisBP is used for motif database.
SEACells
Modeling of regulation, integration with RNA and cell type annotation of scATAC all work better with meta cells. More details coming soon.
Sample notebook - Putting it all together
ArchR is written in R. We first run ArchR and export the relevant results. The R script for PBMC data is here. Jupyter notebook detailing all the analysis for the PBMC ATAC dataset is available here.