Normalization and Feature Selection - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.1.4 Normalization & Feature Selection

Proper normalization removes library‐size effects, and selecting highly variable genes (HVGs) focuses downstream analyses on the most informative features.

Tools & Installation

# Python / Scanpy
conda install -c bioconda scanpy anndata

# R / Seurat + scran
conda install -c conda-forge r-base r-essentials
R -e "install.packages('Seurat'); BiocManager::install(c('SingleCellExperiment','scater','scran'))"

A. Library-Size Normalization & Log-Transformation

CPM + log1p: scales counts per cell to counts-per-million, then log-transforms.

Scanpy (Python)

import scanpy as sc

# assume `adata` is your AnnData with raw counts
sc.pp.normalize_total(adata, target_sum=1e6)   # CPM normalization
sc.pp.log1p(adata)                             # log(1 + CPM)
# store normalized values in adata.X

Seurat (R)

library(Seurat)

# assume `seurat_obj` has raw counts in the “RNA” assay
seurat_obj <- NormalizeData(
  seurat_obj,
  normalization.method = "RC",   # Relative counts (CPM)
  scale.factor = 1e6
)
seurat_obj <- ScaleData(seurat_obj)  # log‐transform + z‐score scaling

B. Variance-Stabilizing & Deconvolution

SCTransform (Seurat): model-based normalization & variance stabilization
scran (R): pooling-based size factor estimation

SCTransform (Seurat)

library(Seurat)

# run SCTransform instead of NormalizeData + ScaleData
seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = "percent.mt",   # optional: regress out mito%
  verbose = FALSE
)

scran (R)

library(SingleCellExperiment)
library(scran)

# convert Seurat to SingleCellExperiment, or load directly into `sce`
sce <- as.SingleCellExperiment(seurat_obj)

# compute size factors by deconvolution
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
sce <- logNormCounts(sce)    # adds log‐CPM to assay “logcounts”

C. Identify Highly Variable Genes (HVGs)

Focusing on HVGs improves clustering and dimensionality reduction.

Scanpy (Python)

# flavor="seurat" or "cell_ranger"
sc.pp.highly_variable_genes(
    adata,
    flavor='seurat_v3',
    n_top_genes=2000,
    subset=True    # keep only HVGs in adata
)

Seurat (R)

# select top 2 000 variable features
seurat_obj <- FindVariableFeatures(
  seurat_obj,
  selection.method = "vst",
  nfeatures = 2000
)
# plot variance‐mean relationship
VariableFeaturePlot(seurat_obj)

Best Practices

Always inspect mean–variance plots before and after HVG selection.
The number of HVGs (1 000–3 000) can be tuned based on dataset size.
For integration of multiple batches, find HVGs per batch or on combined data.
Store normalized data and HVG lists in your Seurat/AnnData object for reproducibility.