Dimension Reduction and Clustering - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.1.5 Dimensionality Reduction & Clustering

After selecting highly variable genes (HVGs), we reduce dimensionality, build a cell–cell graph, cluster cells, and visualize in 2D.


A. Principal Component Analysis (PCA)
  • Purpose: capture the main axes of variation in your data
  • Scanpy (Python)
  # compute PCA
  sc.tl.pca(adata, svd_solver='arpack')
  # visualize explained variance
  sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

Seurat (R)

# compute PCA on variable features
seurat_obj <- RunPCA(seurat_obj, features=VariableFeatures(seurat_obj))
# elbow plot to choose # of PCs
ElbowPlot(seurat_obj, ndims=50)

B. Neighborhood Graph (k-Nearest Neighbors)

  • Purpose: connect each cell to its k nearest neighbors in PC space
  • Scanpy:
sc.pp.neighbors(adata, 
                n_neighbors=15,   # number of neighbors
                n_pcs=30)         # number of PCs to use
  • Seurat
seurat_obj <- FindNeighbors(
  seurat_obj,
  dims = 1:30,         # use top 30 PCs
  k.param = 15         # number of neighbors
)

C. UMAP & t-SNE Visualization

  • UMAP: preserves local + global structure; fast

  • t-SNE: preserves local neighborhoods; may distort global distances

  • Scanpy (Python)

# UMAP
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden','n_genes'], size=20)

# t-SNE (optional)
sc.tl.tsne(adata, n_pcs=30)
sc.pl.tsne(adata, color='leiden', size=20)

  • Seurat
# UMAP
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "umap", label = TRUE)

# t-SNE
seurat_obj <- RunTSNE(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "tsne", label = TRUE)

D. Clustering (Leiden / Louvain)

  • Purpose: group cells into putative cell types or states

  • Scanpy (Leiden)

sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden')

  • Seurat (Louvain)
seurat_obj <- FindClusters(
  seurat_obj,
  resolution = 0.5,    # adjust to get more/fewer clusters
  algorithm = 1        # 1 = Louvain; 4 = Leiden
)
DimPlot(seurat_obj, label = TRUE)

E. Best Practices

  • Choose the number of PCs by inspecting the elbow plot or variance ratio.
  • Tune n_neighbors (10–30) and resolution (0.2–1.2) to match expected biological granularity.
  • Always visualize both UMAP and t-SNE to confirm clustering consistency.
  • Annotate clusters using known marker genes (see Section 6.1.6).