Dimension Reduction and Clustering - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
6.1.5 Dimensionality Reduction & Clustering
After selecting highly variable genes (HVGs), we reduce dimensionality, build a cell–cell graph, cluster cells, and visualize in 2D.
A. Principal Component Analysis (PCA)
- Purpose: capture the main axes of variation in your data
- Scanpy (Python)
# compute PCA
sc.tl.pca(adata, svd_solver='arpack')
# visualize explained variance
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
Seurat (R)
# compute PCA on variable features
seurat_obj <- RunPCA(seurat_obj, features=VariableFeatures(seurat_obj))
# elbow plot to choose # of PCs
ElbowPlot(seurat_obj, ndims=50)
B. Neighborhood Graph (k-Nearest Neighbors)
- Purpose: connect each cell to its k nearest neighbors in PC space
- Scanpy:
sc.pp.neighbors(adata,
n_neighbors=15, # number of neighbors
n_pcs=30) # number of PCs to use
- Seurat
seurat_obj <- FindNeighbors(
seurat_obj,
dims = 1:30, # use top 30 PCs
k.param = 15 # number of neighbors
)
C. UMAP & t-SNE Visualization
-
UMAP: preserves local + global structure; fast
-
t-SNE: preserves local neighborhoods; may distort global distances
-
Scanpy (Python)
# UMAP
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden','n_genes'], size=20)
# t-SNE (optional)
sc.tl.tsne(adata, n_pcs=30)
sc.pl.tsne(adata, color='leiden', size=20)
- Seurat
# UMAP
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "umap", label = TRUE)
# t-SNE
seurat_obj <- RunTSNE(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "tsne", label = TRUE)
D. Clustering (Leiden / Louvain)
-
Purpose: group cells into putative cell types or states
-
Scanpy (Leiden)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden')
- Seurat (Louvain)
seurat_obj <- FindClusters(
seurat_obj,
resolution = 0.5, # adjust to get more/fewer clusters
algorithm = 1 # 1 = Louvain; 4 = Leiden
)
DimPlot(seurat_obj, label = TRUE)
E. Best Practices
- Choose the number of PCs by inspecting the elbow plot or variance ratio.
- Tune
n_neighbors
(10–30) andresolution
(0.2–1.2) to match expected biological granularity. - Always visualize both UMAP and t-SNE to confirm clustering consistency.
- Annotate clusters using known marker genes (see Section 6.1.6).