Preprocessing and QC - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.1.2 Preprocessing & QC

After sequencing, the first step is to turn raw FASTQ files into a gene × cell count matrix, while performing basic QC to remove empty droplets or low-quality cells.


A. Demultiplexing & Alignment
  1. 10x Genomics (Cell Ranger)

    • Input: raw BCL (Illumina) or FASTQ directory
    • Command:
      cellranger count \
        --id=SampleA \
        --transcriptome=/path/to/refdata-cellranger \
        --fastqs=raw_data/SampleA/ \
        --sample=SampleA \
        --localcores=8 \
        --localmem=64
      
    • What it does:
      • Demultiplexes BCL → per‐sample FASTQ (if needed)
      • Aligns reads (STAR internally) to the reference
      • Collapses UMIs, filters barcodes (empty‐droplet removal)
      • Generates outs/filtered_feature_bc_matrix/
  2. Open‐source: STARsolo

    • Input: paired trimmed FASTQ
    • Command:
      mkdir -p align/STARsolo/SampleA
      STAR \
        --runThreadN 8 \
        --genomeDir ref/STAR_index \
        --readFilesIn trimmed/SampleA_R1.fastq.gz trimmed/SampleA_R2.fastq.gz \
        --readFilesCommand zcat \
        --soloType CB_UMI_Simple \
        --soloCBstart 1 --soloCBlen 16 \
        --soloUMIstart 17 --soloUMIlen 12 \
        --soloFeatures Gene \
        --soloOutFileNames align/STARsolo/SampleA/ Solo.out
      
    • What it does:
      • Aligns reads in splice-aware mode
      • Extracts cell barcodes (CB) & UMIs
      • Produces Solo.out/Gene/filtered directory

B. Generating Gene × Cell Matrices

Whether using Cell Ranger or STARsolo, the final filtered matrix is in Matrix Market format:

Outputs (gene × cell count matrices):

  • Cell Ranger
    outs/filtered_feature_bc_matrix/

  • STARsolo
    align/STARsolo/SampleA/Solo.out/Gene/filtered

    ├── barcodes.tsv.gz    # list of cell barcodes  
    ├── features.tsv.gz    # gene IDs and names  
    └── matrix.mtx.gz      # sparse count matrix (genes × cells)  
    
    
  1. Inspect file contents:
   zcat matrix.mtx.gz | head -n 5
   zcat barcodes.tsv.gz | head -n 5
   zcat features.tsv.gz | head -n 5
  1. Load into R (Seurat):
library(Seurat)
counts <- Read10X(data.dir = "outs/filtered_feature_bc_matrix/")
seurat_obj <- CreateSeuratObject(counts, project="SampleA")

3.** Load into Python (Scanpy):**

import scanpy as sc
adata = sc.read_10x_mtx(
    "outs/filtered_feature_bc_matrix/",
    var_names='gene_symbols',
    cache=True
)

QC Checks

  • Cell count: number of barcodes in barcodes.tsv.gz
  • Gene count per cell: nFeature_RNA (Seurat) or adata.obs['n_genes'] (Scanpy)
  • UMI count per cell: nCount_RNA or adata.obs['n_counts']
  • Mitochondrial fraction: percentage of reads mapping to mitochondrial genes

Use these metrics to filter out low-quality cells before downstream analysis.