Demystifying SEURAT - arbones/teaching GitHub Wiki
The FindNeighbors()
and FindClusters()
functions are central to the graph-based clustering approach utilized by Seurat for single-cell RNA sequencing (scRNA-seq) data analysis.
This video shows how to perform the unsupervised identification of cell populations with similar gene expression profiles, which can then be annotated as cell types or states.
The FindNeighbors()
function is the first step in Seurat's graph-based clustering workflow, responsible for constructing a cell-cell similarity graph.
The primary purpose of FindNeighbors()
is to embed cells in a graph structure, typically a K-nearest neighbor (KNN) graph. Edges are drawn between cells that exhibit similar feature expression patterns, forming the basis for subsequent community detection.
- Initial KNN Graph Construction: The function initially constructs a KNN graph based on the Euclidean distance in Principal Component Analysis (PCA) space. PCA is commonly used to reduce the extensive technical noise in scRNA-seq data, with each PC representing a "metagene" that linearly combines information across a correlated gene set.
- Nearest-Neighbor Identification Method: In Seurat v4 and v5, the default method for identifying k-nearest neighbors has been set to annoy. This is an approximate nearest-neighbor approach that significantly improves the speed and memory requirements of neighbor discovery with negligible impact on downstream results. Users have the option to revert to the previous default, rann, if desired.
- Edge Weight Refinement (Jaccard Similarity): After the initial KNN graph is built,
FindNeighbors()
refines the edge weights between any two cells based on the shared overlap in their local neighborhoods. This refinement is typically performed using Jaccard similarity. The Jaccard similarity coefficient for two sets of neighbors (ν(i) and ν(j)) is calculated as the size of their intersection divided by the size of their union: W_ij = |ν(i) ∩ ν(j)| / |ν(i) ∪ ν(j)|. This metric incorporates the structure of the data distribution into the weights, strengthening edges in dense regions and penalizing edges that span sparse regions, and helps to resolve rare populations and exclude outliers.
- reduction: Specifies the dimensional reduction to use for building the graph, with "pca" as the default.
- dims: Determines the number of Principal Components (PCs) to include in the analysis, e.g., 1:10 as a common default. The number of PCs chosen depends on the heterogeneity of the data.
- k.param: Defines the number of neighbors (k) in the KNN graph (default 20 in Seurat). This parameter significantly influences the network structure and, consequently, the resolution of the clusters.
- prune.snn: A cutoff for pruning edges in the Shared Nearest Neighbor (SNN) graph (default 1/15 in Seurat).
Inspiration: Seurat's graph-based clustering approach, including the SNN graph construction, was heavily inspired by methods like SNN-Cliq and PhenoGraph. SNN-Cliq is designed to handle high-dimensional data, automatically determine cluster numbers, and identify clusters of different densities and shapes. PhenoGraph robustly partitions high-parameter single-cell data into phenotypically distinct subpopulations by building a nearest-neighbor graph and finding communities within it.
The FindClusters()
function takes the graph structure generated by FindNeighbors()
and partitions it into distinct cell clusters.
To group cells together into "communities" or "quasi-cliques" based on their interconnectedness within the graph, thereby identifying phenotypically coherent subpopulations.
- Modularity Optimization:
FindClusters()
applies modularity optimization techniques. Modularity is a scalar value between -1 and 1 that measures the density of links inside communities compared to links between communities. The goal is to find a partition of the graph that maximizes this modularity. - Community Detection Algorithms: Commonly used algorithms include the Louvain algorithm (default in Seurat) or the SLM (Smart Local Moving) algorithm. The Louvain method is particularly efficient for large graphs and is also used by PhenoGraph for community detection.
- Louvain Method: This method is hierarchical and agglomerative. It starts by placing each node (cell) into its own community. In iterative steps, neighboring nodes are merged into communities if their merger results in the largest increase in the overall modularity of the graph. This process is repeated hierarchically until a local maximum of modularity is achieved, producing a complete hierarchical community structure. It's known for its speed and ability to handle large networks.
- resolution: This is a crucial parameter that controls the "granularity" of the downstream clustering. Higher values typically lead to a greater number of clusters, while smaller values result in fewer, broader clusters. For datasets of around 3,000 cells, a resolution between 0.4 and 1.2 often yields good results, with optimal resolution increasing for larger datasets.
- algorithm: Allows users to specify different community detection algorithms, such as Louvain, Louvain with multilevel refinement, Leiden, or SLM.
- Parallelization:
FindClusters()
(specifically when clustering over multiple resolutions) has been "futurized" in Seurat, meaning it can take advantage of the future framework for parallelized computation, which can decrease runtimes for large datasets. Broader Context and Interplay. - Graph-based Clustering Advantages: This method is highly scalable, contrasting with hierarchical clustering methods that can have quadratic runtimes. It avoids strong assumptions about cluster shape or cell distribution, unlike k-means (which favors spherical clusters) or Gaussian mixture models. It also reduces the risk of generating many single-cell outlier clusters by forcibly connecting each cell to a minimum number of neighbors.
A drawback is that after graph construction, no information is retained beyond neighboring cells, which can make clustering resolution dependent on cell density. This "inflates" high-density regions, potentially leading to subclusters from internal substructure or noise.
Clustering in Seurat is commonly performed in PCA space, using highly variable genes (HVGs) selected earlier in the workflow.
Seurat also supports advanced techniques like Weighted Nearest Neighbor (WNN) analysis for multimodal data (e.g., RNA and protein), and spatial clustering algorithms like BANKSY for spatial omics data. BANKSY uses a neighbor-augmented embedding and can use the Leiden algorithm for clustering, demonstrating compatibility with Seurat's graph-based framework.
After an initial clustering, users can perform subclustering by repeating feature selection and clustering within a specific cluster to improve resolution for internal structures.
For very large datasets like Visium HD, Seurat v5 introduces a sketch clustering workflow. This approach "subsamples" large datasets to preserve rare populations, performs clustering on the subsampled cells, and then projects the cluster labels back to the full dataset, improving performance for rare and spatially restricted groups.
In essence, FindNeighbors()
establishes the interconnectedness of cells based on their gene expression similarities in a reduced dimensional space, and FindClusters()
then leverages this network to identify coherent cellular communities, forming the foundation of cell type and state identification in single-cell analysis.