Understanding and Controlling Strand Bias - ChromatinCloud/SeqForge GitHub Wiki

In an ideal sequencing experiment, the reads covering any position in the genome should originate in roughly equal numbers from the forward (+) and reverse (-) DNA strands. Strand bias is a technical artifact that occurs when there is a significant, non-random imbalance in the strand origin of reads supporting a specific allele.

For example, at a heterozygous A/G variant site, you expect reads showing the 'A' allele and reads showing the 'G' allele to each come from a mix of forward and reverse strands. If, however, nearly all reads for the 'G' allele are from the forward strand and all reads for the 'A' allele are from the reverse strand, this indicates a strong strand bias.

What causes it?

Technical Artifacts: Strand bias is overwhelmingly a technical artifact introduced during the NGS library preparation phase. Specific steps, like DNA fragmentation, PCR amplification, or target-capture hybridization, can damage or fail to capture one DNA strand more efficiently than the other. A classic cause is the oxidation of guanine bases, which can lead to strand-specific sequencing errors.
Significance in Analysis: Because strand bias is rarely associated with true biological variation, it is a powerful indicator of a false positive variant call. Variant callers and filtering workflows (like GATK) use statistical tests (e.g., Fisher's Exact Test) to detect strand bias and assign poor quality scores to, or outright filter, variants that exhibit it.

Why simulate it? Simulating strand bias is crucial for validating the robustness of a variant calling pipeline. By intentionally creating a BAM file with variants that have a known, high degree of strand bias, you can rigorously test whether your filtering steps are correctly identifying and removing these common artifacts. This ensures your final call set is of high confidence. 2. The Tooling and Methodology

BaseBuddy introduces strand bias using a straightforward and transparent method built on the industry-standard SAMtools toolkit. This approach avoids unnecessary complexity while guaranteeing a valid and correctly formatted BAM file.

The basebuddy strand-bias command executes the following logic:

Separate by Strand: The input BAM is split into two temporary files: one containing only reads mapped to the forward strand (using samtools view -f 16) and another containing only reads mapped to the reverse strand (using samtools view -F 16).
Subsample Reads: Based on the user-defined --forward-fraction, the tool calculates the proportion of reads to keep from each file. For a --forward-fraction 0.9, it keeps 90% of the forward reads and 10% of the reverse reads. This random subsampling is performed using samtools view -s and is made reproducible via a user-provided seed.
Merge and Finalize: The two subsampled sets of reads are merged into a single new BAM file.
Index and Clean Up: The final BAM file is indexed, creating a .bai file. All intermediate files are then deleted.

This method is efficient, reproducible, and relies entirely on the most trusted tool for BAM manipulation, ensuring a high-quality, reliable output. 3. The Syntax: Introducing Strand Bias in BaseBuddy

The basebuddy strand-bias command takes an existing BAM file and produces a new one with the desired level of strand-specific imbalance.

Core Command Structure: Bash

basebuddy strand-bias [INPUT_BAM] [OPTIONS]

INPUT_BAM: (Required) Path to the sorted and indexed input BAM file.

Key Options and Usage:

--out-bam / -o: (Required) The file path for the final, biased output BAM.
    Example: --out-bam biased.bam
--forward-fraction: (Required) A number between 0.0 and 1.0 that specifies the target fraction of reads from the forward strand. A value of 0.8 means the output will have approximately 80% forward reads and 20% reverse reads.
    Example: --forward-fraction 0.8
--seed: An integer to serve as a random seed, which makes the read subsampling process reproducible.
    Example: --seed 42

Practical Example:

You have a file my_sample.bam and want to create a new version, my_sample.biased.bam, where reads from the forward strand are four times more likely than reads from the reverse strand (80% forward fraction). Bash

basebuddy strand-bias my_sample.bam
--out-bam my_sample.biased.bam
--forward-fraction 0.8
--seed 123

Output:

my_sample.biased.bam: The new BAM file with the introduced strand bias.
my_sample.biased.bam.bai: The corresponding index for the new BAM file.

Common Edge Cases:

Missing Index: If the input BAM is not indexed, BaseBuddy will try to index it automatically. It's best practice to ensure it's indexed beforehand.
Invalid Fraction: The command will exit with an error if the --forward-fraction is not between 0.0 and 1.0.
Disk Space: The process creates temporary files that can be as large as the original BAM. Ensure you have sufficient disk space (at least 2x the input BAM size) in the output directory.

Wiki Article: Read Quality Control with BaseBuddy

This article covers the importance of Quality Control (QC) for sequencing data, explains the choice of FastQC as the tool, and details the syntax for the basebuddy qc command.

The Science: The Importance of Quality Control (QC)

Raw data from a next-generation sequencer is never perfect. The complex series of biochemical and optical steps can introduce various errors and biases. Quality control (QC) is the critical first step in any bioinformatics analysis, designed to assess the quality of raw FASTQ data and identify potential problems that could compromise downstream results.

Key QC metrics evaluated include:

Per Base Sequence Quality: A graph of Phred quality scores across the length of the reads. This helps spot systematic drops in quality, for example, towards the 3' end of reads.
Per Sequence Quality Scores: A histogram showing the average quality of all reads. This reveals if a large fraction of the data is of poor quality.
Per Base Sequence Content: The percentage of A, C, G, and T at each position. Deviations from a random distribution can indicate library preparation artifacts or adapter contamination.
GC Content Distribution: A plot of the GC content across all reads compared to an ideal distribution. An unusual shape can suggest contamination or a GC bias from PCR.
Overrepresented Sequences and Adapter Content: A search for specific sequences that appear an abnormal number of times, which is a tell-tale sign of unremoved sequencing adapters.

Performing QC allows a researcher to make an informed decision: Is the data high-quality and ready for analysis? Does it require cleaning (e.g., adapter trimming)? Or was the sequencing run flawed, requiring a do-over? 2. The Tooling Selection: Why FastQC?

BaseBuddy integrates FastQC, the undisputed, universal standard for sequencing data QC. It has been the cornerstone of bioinformatics pipelines for over a decade for these reasons:

Comprehensive: FastQC runs a battery of tests that cover all the key metrics listed above, providing a thorough health check of the data.
Intuitive Reports: Its primary output is a self-contained HTML report with clear graphs and plain-English interpretations. A simple "traffic light" system (Green=Pass, Yellow=Warning, Red=Fail) makes it easy to spot potential issues.
Efficient and Fast: FastQC is optimized to process very large FASTQ files quickly and can be multithreaded to analyze several files in parallel.
Ubiquitous: The FastQC report format is understood by virtually every bioinformatician, making it the standard for data sharing, collaboration, and publication.

By wrapping FastQC, BaseBuddy provides direct access to this essential tool, creating a seamless workflow from data simulation to quality assessment. 3. The Syntax: Running QC in BaseBuddy

The basebuddy qc command is a convenient wrapper around FastQC, designed to analyze one or more FASTQ files and organize the reports cleanly.

Core Command Structure: Bash

basebuddy qc [FASTQ_FILES...] [OPTIONS]

FASTQ_FILES...: (Required) One or more paths to input FASTQ files (e.g., .fastq or .fastq.gz).

Key Options and Usage:

--output-dir / -o: (Required) The main directory where QC results will be saved. A run-specific subdirectory will be created inside this location.
    Example: --output-dir ./qc_results
--run-name: An optional, descriptive name for the QC run, which will be used to name the output subdirectory.
    Example: --run-name my_simulation_qc
--threads / -t: The number of threads FastQC should use. Using more threads speeds up the analysis of multiple files.
    Example: --threads 8
--overwrite: A flag to permit overwriting an existing output subdirectory with the same name.

Practical Example:

You have just simulated a pair of read files, sim_reads_R1.fastq.gz and sim_reads_R2.fastq.gz. You want to run FastQC on both. Bash

basebuddy qc sim_reads_R1.fastq.gz sim_reads_R2.fastq.gz
--output-dir ./qc/
--run-name my_sim_run_1
--threads 2

Output: The command will create a directory like ./qc/my_sim_run_1_YYYYMMDD_HHMMSS/. Inside, you will find:

FastQC Reports: A separate folder for each input file (e.g., sim_reads_R1_fastqc/).
HTML File: The primary report, fastqc_report.html, inside each of those folders. This can be opened in any web browser.
Manifest File: A manifest_fastqc_run.json that logs the inputs and output locations for the run.