Normalization and Feature Selection - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.1.4 Normalization & Feature Selection

Proper normalization removes library‐size effects, and selecting highly variable genes (HVGs) focuses downstream analyses on the most informative features.


Tools & Installation
# Python / Scanpy
conda install -c bioconda scanpy anndata

# R / Seurat + scran
conda install -c conda-forge r-base r-essentials
R -e "install.packages('Seurat'); BiocManager::install(c('SingleCellExperiment','scater','scran'))"

A. Library-Size Normalization & Log-Transformation

  • CPM + log1p: scales counts per cell to counts-per-million, then log-transforms.

Scanpy (Python)

import scanpy as sc

# assume `adata` is your AnnData with raw counts
sc.pp.normalize_total(adata, target_sum=1e6)   # CPM normalization
sc.pp.log1p(adata)                             # log(1 + CPM)
# store normalized values in adata.X

Seurat (R)

library(Seurat)

# assume `seurat_obj` has raw counts in the “RNA” assay
seurat_obj <- NormalizeData(
  seurat_obj,
  normalization.method = "RC",   # Relative counts (CPM)
  scale.factor = 1e6
)
seurat_obj <- ScaleData(seurat_obj)  # log‐transform + z‐score scaling

B. Variance-Stabilizing & Deconvolution

  • SCTransform (Seurat): model-based normalization & variance stabilization
  • scran (R): pooling-based size factor estimation

SCTransform (Seurat)

library(Seurat)

# run SCTransform instead of NormalizeData + ScaleData
seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = "percent.mt",   # optional: regress out mito%
  verbose = FALSE
)

scran (R)

library(SingleCellExperiment)
library(scran)

# convert Seurat to SingleCellExperiment, or load directly into `sce`
sce <- as.SingleCellExperiment(seurat_obj)

# compute size factors by deconvolution
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
sce <- logNormCounts(sce)    # adds log‐CPM to assay “logcounts”

C. Identify Highly Variable Genes (HVGs)

Focusing on HVGs improves clustering and dimensionality reduction.

Scanpy (Python)

# flavor="seurat" or "cell_ranger"
sc.pp.highly_variable_genes(
    adata,
    flavor='seurat_v3',
    n_top_genes=2000,
    subset=True    # keep only HVGs in adata
)

Seurat (R)

# select top 2 000 variable features
seurat_obj <- FindVariableFeatures(
  seurat_obj,
  selection.method = "vst",
  nfeatures = 2000
)
# plot variance‐mean relationship
VariableFeaturePlot(seurat_obj)

Best Practices

  • Always inspect mean–variance plots before and after HVG selection.
  • The number of HVGs (1 000–3 000) can be tuned based on dataset size.
  • For integration of multiple batches, find HVGs per batch or on combined data.
  • Store normalized data and HVG lists in your Seurat/AnnData object for reproducibility.