Long‐Read Simulation with BaseBuddy - ChromatinCloud/SeqForge GitHub Wiki

This article covers the science behind long-read sequencing, explains the choice of NanoSim-h for simulation, and details the syntax of the basebuddy long command.

The Science: Understanding Long-Read Sequencing

Long-read sequencing, offered by platforms like Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), has revolutionized genomics by producing individual reads that are thousands to millions of base pairs long. This contrasts with short-read technologies that generate reads of only a few hundred bases.

The Oxford Nanopore Process: BaseBuddy simulates Nanopore reads, which are generated via a unique mechanism:

Library Prep: Long strands of DNA are prepared with a motor protein and adapter at one end.
Translocation: The DNA library is loaded onto a flow cell containing thousands of protein nanopores. The motor protein guides a single DNA strand through a pore.
Signal Detection: As the DNA strand passes through the pore, it disrupts an ionic current flowing through it. Each nucleotide base (A, C, G, T) causes a distinct, measurable change in this electrical signal.
Basecalling: Sophisticated deep learning algorithms act as "basecallers," translating the raw electrical signal data (called a "squiggle") back into a DNA base sequence (A, C, G, T).

Key Characteristics & Advantages:

Length: The defining feature. Long reads can span entire genes, repetitive genomic regions, and large structural variants (SVs) like deletions, insertions, and inversions, which are exceptionally difficult to resolve with short reads.
Error Profile: Traditionally, long reads have a higher single-base error rate than short reads. However, these errors are more random (often small insertions and deletions) rather than systematic, and the accuracy of modern basecallers is improving at a dramatic pace.
Genome Assembly: The length of these reads is instrumental in achieving complete, "telomere-to-telomere" genome assemblies by bridging gaps and resolving complex repeats.

Why Simulate Long Reads?

Structural Variant (SV) Benchmarking: To test the accuracy of SV detection tools (e.g., Sniffles, cuteSV), researchers need a truth set with known, large-scale variants. Simulating these variants is the most effective way to create such a dataset.
Assembly Algorithm Development: Simulating long reads with different properties (length distributions, error rates) is crucial for stress-testing and improving genome assembly algorithms.
Evaluating Analysis Workflows: Simulation helps researchers understand how the unique error profile of long reads might affect alignment, variant calling, or other downstream analyses.

The Tooling Selection: Why NanoSim-h?

BaseBuddy integrates NanoSim-h (Nanopore Simulator for reads from any species), a highly-cited and effective simulator specifically designed to mimic Oxford Nanopore data.

Profile-Based Realism: NanoSim-h's greatest strength is its use of statistical profiles trained on real Nanopore data. It doesn't just add random errors; it models the characteristic error patterns (insertions, deletions, substitutions) and read-length distributions of specific Nanopore chemistries and basecallers.
Pre-Trained Models: It ships with pre-trained models for different flow cells and kits (e.g., R9.4.1), allowing users to generate data that closely mirrors a specific experimental protocol.
Ease of Use: It requires a simple set of inputs (a reference genome, a depth, a model) and produces standard FASTQ output, making it easy to integrate into a pipeline.

The Syntax: Simulating Long Reads in BaseBuddy

The basebuddy long command provides a clean and simple interface for NanoSim-h.

Core Command Structure: Bash

basebuddy long [REFERENCE_FASTA] [OPTIONS]

REFERENCE_FASTA: (Required) Path to the input reference genome in FASTA format.

Key Options and Usage:

--depth / -d: (Required) The average sequencing coverage to simulate.
    Example: --depth 30
--model: The pre-trained NanoSim-h error model to use. The default is a common R9.4.1 profile.
    Example: --model nanopore_R9.4.1
--outdir / -o: The directory where the output FASTQ files will be saved.
    Example: --outdir ./long_read_output

Practical Example:

To simulate 20x coverage of the human chr22.fa using a standard Nanopore R9.4.1 model: Bash

basebuddy long /refs/chr22.fa --depth 20 --outdir ./chr22_nanopore_sim

Output: The command will create a directory structure within ./chr22_nanopore_sim/ containing:

FASTQ File(s): The simulated long reads in one or more .fastq files.
Supporting Files: Logs and summary files generated by NanoSim-h.

Common Edge Cases:

RuntimeError: nanosim-h not found in PATH: This means the NanoSim-h executable is not in your environment. The recommended fix is to install it with conda install -c bioconda nanosim-h within the project's Conda environment.
High Resource Usage: Simulating long reads for a large genome is resource-intensive. A 30x simulation of the human genome will generate ~90GB of data and requires significant memory. It is always best to test commands on a small FASTA file (e.g., a single chromosome) first.