Normalization and Feature Selection - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
6.1.4 Normalization & Feature Selection
Proper normalization removes library‐size effects, and selecting highly variable genes (HVGs) focuses downstream analyses on the most informative features.
Tools & Installation
# Python / Scanpy
conda install -c bioconda scanpy anndata
# R / Seurat + scran
conda install -c conda-forge r-base r-essentials
R -e "install.packages('Seurat'); BiocManager::install(c('SingleCellExperiment','scater','scran'))"
A. Library-Size Normalization & Log-Transformation
- CPM + log1p: scales counts per cell to counts-per-million, then log-transforms.
Scanpy (Python)
import scanpy as sc
# assume `adata` is your AnnData with raw counts
sc.pp.normalize_total(adata, target_sum=1e6) # CPM normalization
sc.pp.log1p(adata) # log(1 + CPM)
# store normalized values in adata.X
Seurat (R)
library(Seurat)
# assume `seurat_obj` has raw counts in the “RNA” assay
seurat_obj <- NormalizeData(
seurat_obj,
normalization.method = "RC", # Relative counts (CPM)
scale.factor = 1e6
)
seurat_obj <- ScaleData(seurat_obj) # log‐transform + z‐score scaling
B. Variance-Stabilizing & Deconvolution
- SCTransform (Seurat): model-based normalization & variance stabilization
- scran (R): pooling-based size factor estimation
SCTransform (Seurat)
library(Seurat)
# run SCTransform instead of NormalizeData + ScaleData
seurat_obj <- SCTransform(
seurat_obj,
vars.to.regress = "percent.mt", # optional: regress out mito%
verbose = FALSE
)
scran (R)
library(SingleCellExperiment)
library(scran)
# convert Seurat to SingleCellExperiment, or load directly into `sce`
sce <- as.SingleCellExperiment(seurat_obj)
# compute size factors by deconvolution
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
sce <- logNormCounts(sce) # adds log‐CPM to assay “logcounts”
C. Identify Highly Variable Genes (HVGs)
Focusing on HVGs improves clustering and dimensionality reduction.
Scanpy (Python)
# flavor="seurat" or "cell_ranger"
sc.pp.highly_variable_genes(
adata,
flavor='seurat_v3',
n_top_genes=2000,
subset=True # keep only HVGs in adata
)
Seurat (R)
# select top 2 000 variable features
seurat_obj <- FindVariableFeatures(
seurat_obj,
selection.method = "vst",
nfeatures = 2000
)
# plot variance‐mean relationship
VariableFeaturePlot(seurat_obj)
Best Practices
- Always inspect mean–variance plots before and after HVG selection.
- The number of HVGs (1 000–3 000) can be tuned based on dataset size.
- For integration of multiple batches, find HVGs per batch or on combined data.
- Store normalized data and HVG lists in your Seurat/AnnData object for reproducibility.