Dimension Reduction and Clustering - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.1.5 Dimensionality Reduction & Clustering

After selecting highly variable genes (HVGs), we reduce dimensionality, build a cell–cell graph, cluster cells, and visualize in 2D.

A. Principal Component Analysis (PCA)

Purpose: capture the main axes of variation in your data
Scanpy (Python)

  # compute PCA
  sc.tl.pca(adata, svd_solver='arpack')
  # visualize explained variance
  sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

Seurat (R)

# compute PCA on variable features
seurat_obj <- RunPCA(seurat_obj, features=VariableFeatures(seurat_obj))
# elbow plot to choose # of PCs
ElbowPlot(seurat_obj, ndims=50)

B. Neighborhood Graph (k-Nearest Neighbors)

Purpose: connect each cell to its k nearest neighbors in PC space
Scanpy:

sc.pp.neighbors(adata, 
                n_neighbors=15,   # number of neighbors
                n_pcs=30)         # number of PCs to use

Seurat

seurat_obj <- FindNeighbors(
  seurat_obj,
  dims = 1:30,         # use top 30 PCs
  k.param = 15         # number of neighbors
)

C. UMAP & t-SNE Visualization

UMAP: preserves local + global structure; fast
t-SNE: preserves local neighborhoods; may distort global distances
Scanpy (Python)

# UMAP
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden','n_genes'], size=20)

# t-SNE (optional)
sc.tl.tsne(adata, n_pcs=30)
sc.pl.tsne(adata, color='leiden', size=20)

Seurat

# UMAP
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "umap", label = TRUE)

# t-SNE
seurat_obj <- RunTSNE(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "tsne", label = TRUE)

D. Clustering (Leiden / Louvain)

Purpose: group cells into putative cell types or states
Scanpy (Leiden)

sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden')

Seurat (Louvain)

seurat_obj <- FindClusters(
  seurat_obj,
  resolution = 0.5,    # adjust to get more/fewer clusters
  algorithm = 1        # 1 = Louvain; 4 = Leiden
)
DimPlot(seurat_obj, label = TRUE)

E. Best Practices

Choose the number of PCs by inspecting the elbow plot or variance ratio.
Tune n_neighbors (10–30) and resolution (0.2–1.2) to match expected biological granularity.
Always visualize both UMAP and t-SNE to confirm clustering consistency.
Annotate clusters using known marker genes (see Section 6.1.6).