Read Quality Control with BaseBuddy - ChromatinCloud/BaseBuddy GitHub Wiki
This article covers the importance of Quality Control (QC) for sequencing data, explains the choice of FastQC as the tool, and details the syntax for the basebuddy qc command.
- The Science: The Importance of Quality Control (QC)
Raw data from a next-generation sequencer is never perfect. The complex series of biochemical and optical steps can introduce various errors and biases. Quality control (QC) is the critical first step in any bioinformatics analysis, designed to assess the quality of raw FASTQ data and identify potential problems that could compromise downstream results.
Key QC metrics evaluated include:
Per Base Sequence Quality: A graph of Phred quality scores across the length of the reads. This helps spot systematic drops in quality, for example, towards the 3' end of reads.
Per Sequence Quality Scores: A histogram showing the average quality of all reads. This reveals if a large fraction of the data is of poor quality.
Per Base Sequence Content: The percentage of A, C, G, and T at each position. Deviations from a random distribution can indicate library preparation artifacts or adapter contamination.
GC Content Distribution: A plot of the GC content across all reads compared to an ideal distribution. An unusual shape can suggest contamination or a GC bias from PCR.
Overrepresented Sequences and Adapter Content: A search for specific sequences that appear an abnormal number of times, which is a tell-tale sign of unremoved sequencing adapters.
Performing QC allows a researcher to make an informed decision: Is the data high-quality and ready for analysis? Does it require cleaning (e.g., adapter trimming)? Or was the sequencing run flawed, requiring a do-over? 2. The Tooling Selection: Why FastQC?
BaseBuddy integrates FastQC, the undisputed, universal standard for sequencing data QC. It has been the cornerstone of bioinformatics pipelines for over a decade for these reasons:
Comprehensive: FastQC runs a battery of tests that cover all the key metrics listed above, providing a thorough health check of the data.
Intuitive Reports: Its primary output is a self-contained HTML report with clear graphs and plain-English interpretations. A simple "traffic light" system (Green=Pass, Yellow=Warning, Red=Fail) makes it easy to spot potential issues.
Efficient and Fast: FastQC is optimized to process very large FASTQ files quickly and can be multithreaded to analyze several files in parallel.
Ubiquitous: The FastQC report format is understood by virtually every bioinformatician, making it the standard for data sharing, collaboration, and publication.
By wrapping FastQC, BaseBuddy provides direct access to this essential tool, creating a seamless workflow from data simulation to quality assessment. 3. The Syntax: Running QC in BaseBuddy
The basebuddy qc command is a convenient wrapper around FastQC, designed to analyze one or more FASTQ files and organize the reports cleanly.
Core Command Structure: Bash
basebuddy qc [FASTQ_FILES...] [OPTIONS]
FASTQ_FILES...: (Required) One or more paths to input FASTQ files (e.g., .fastq or .fastq.gz).
Key Options and Usage:
--output-dir / -o: (Required) The main directory where QC results will be saved. A run-specific subdirectory will be created inside this location.
Example: --output-dir ./qc_results
--run-name: An optional, descriptive name for the QC run, which will be used to name the output subdirectory.
Example: --run-name my_simulation_qc
--threads / -t: The number of threads FastQC should use. Using more threads speeds up the analysis of multiple files.
Example: --threads 8
--overwrite: A flag to permit overwriting an existing output subdirectory with the same name.
Practical Example:
You have just simulated a pair of read files, sim_reads_R1.fastq.gz and sim_reads_R2.fastq.gz. You want to run FastQC on both. Bash
basebuddy qc sim_reads_R1.fastq.gz sim_reads_R2.fastq.gz
--output-dir ./qc/
--run-name my_sim_run_1
--threads 2
Output: The command will create a directory like ./qc/my_sim_run_1_YYYYMMDD_HHMMSS/. Inside, you will find:
FastQC Reports: A separate folder for each input file (e.g., sim_reads_R1_fastqc/).
HTML File: The primary report, fastqc_report.html, inside each of those folders. This can be opened in any web browser.
Manifest File: A manifest_fastqc_run.json that logs the inputs and output locations for the run.