Preprocessing and QC - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
6.1.2 Preprocessing & QC
After sequencing, the first step is to turn raw FASTQ files into a gene × cell count matrix, while performing basic QC to remove empty droplets or low-quality cells.
A. Demultiplexing & Alignment
-
10x Genomics (Cell Ranger)
- Input: raw BCL (Illumina) or FASTQ directory
- Command:
cellranger count \ --id=SampleA \ --transcriptome=/path/to/refdata-cellranger \ --fastqs=raw_data/SampleA/ \ --sample=SampleA \ --localcores=8 \ --localmem=64
- What it does:
- Demultiplexes BCL → per‐sample FASTQ (if needed)
- Aligns reads (STAR internally) to the reference
- Collapses UMIs, filters barcodes (empty‐droplet removal)
- Generates
outs/filtered_feature_bc_matrix/
-
Open‐source: STARsolo
- Input: paired trimmed FASTQ
- Command:
mkdir -p align/STARsolo/SampleA STAR \ --runThreadN 8 \ --genomeDir ref/STAR_index \ --readFilesIn trimmed/SampleA_R1.fastq.gz trimmed/SampleA_R2.fastq.gz \ --readFilesCommand zcat \ --soloType CB_UMI_Simple \ --soloCBstart 1 --soloCBlen 16 \ --soloUMIstart 17 --soloUMIlen 12 \ --soloFeatures Gene \ --soloOutFileNames align/STARsolo/SampleA/ Solo.out
- What it does:
- Aligns reads in splice-aware mode
- Extracts cell barcodes (CB) & UMIs
- Produces
Solo.out/Gene/filtered
directory
B. Generating Gene × Cell Matrices
Whether using Cell Ranger or STARsolo, the final filtered matrix is in Matrix Market format:
Outputs (gene × cell count matrices):
-
Cell Ranger
outs/filtered_feature_bc_matrix/
-
STARsolo
align/STARsolo/SampleA/Solo.out/Gene/filtered
├── barcodes.tsv.gz # list of cell barcodes ├── features.tsv.gz # gene IDs and names └── matrix.mtx.gz # sparse count matrix (genes × cells)
- Inspect file contents:
zcat matrix.mtx.gz | head -n 5
zcat barcodes.tsv.gz | head -n 5
zcat features.tsv.gz | head -n 5
- Load into R (Seurat):
library(Seurat)
counts <- Read10X(data.dir = "outs/filtered_feature_bc_matrix/")
seurat_obj <- CreateSeuratObject(counts, project="SampleA")
3.** Load into Python (Scanpy):**
import scanpy as sc
adata = sc.read_10x_mtx(
"outs/filtered_feature_bc_matrix/",
var_names='gene_symbols',
cache=True
)
QC Checks
- Cell count: number of barcodes in
barcodes.tsv.gz
- Gene count per cell:
nFeature_RNA
(Seurat) oradata.obs['n_genes']
(Scanpy) - UMI count per cell:
nCount_RNA
oradata.obs['n_counts']
- Mitochondrial fraction: percentage of reads mapping to mitochondrial genes
Use these metrics to filter out low-quality cells before downstream analysis.