QC metrics - Snitkin-Lab-Umich/QCD GitHub Wiki
QC Metrics Wiki
This document describes the Quality Control (QC) metrics generated and summarized in the QCD pipeline. The summary table is output as {prefix}_QC_summary.csv
and integrates results from multiple steps in the workflow.
Input Data Sources
The summary function merges the following data sources:
- Coverage: Read coverage statistics per sample.
- MLST: Multi-locus sequence typing results.
- MultiQC FastQC: Raw and trimmed read quality metrics.
- QUAST: Assembly quality metrics (N50, total length).
- Contig distribution: Number of contigs per sample.
- Skani: Closest reference genome and species assignment.
- Genome size table: Reference genome sizes for length checks.
- Failed samples: Optionally, samples that failed coverage or assembly.
Output Columns
Column Name | Description |
---|---|
Sample | Sample identifier. |
Total_reads | Total number of reads for the sample. |
Total_bp | Total number of base pairs sequenced. |
MeanReadLength | Mean read length. |
Coverage | Estimated genome coverage (X). |
Scheme | MLST scheme used. |
ST | Sequence type (from MLST). |
After_trim_per_base_sequence_content | Per-base sequence content after trimming. |
After_trim_overrepresented_sequences | Overrepresented sequences after trimming. |
After_trim_%GC | GC content (%) after trimming. |
After_trim_Total Bases | Total number of bases after trimming. |
After_trim_Total Sequences | Total number of sequences after trimming. |
After_trim_median_sequence_length | Median sequence length after trimming. |
After_trim_avg_sequence_length | Average sequence length after trimming. |
After_trim_total_deduplicated_percentage | Percentage of deduplicated sequences after trimming. |
After_trim_Sequence length | Sequence length after trimming. |
After_trim_adapter_content | Adapter content after trimming. |
N50 | N50 value from assembly (length at which 50% of the assembly is contained in contigs of this length or longer). |
Total length | Total length of the assembly. |
Total # of contigs | Total number of contigs in the assembly. |
QC Check | Final QC status for the sample (PASS/FAIL/Run FAIL). |
ANI | Average Nucleotide Identity to closest reference (from Skani). |
Align_fraction_ref | Fraction of reference genome aligned. |
Align_fraction_query | Fraction of query genome aligned. |
Ref_name | Closest reference genome name. |
Species | Species assignment based on reference. |
QC Check Logic
A sample is marked as FAIL if any of the following conditions are met:
Total # of contigs
>max_contigs
(from config)Total # of contigs
<min_contigs
(from config)Coverage
<coverage
(from config)- Assembly length is not within ±15% of the expected species genome size
- Missing contig or assembly length information
Otherwise, the sample is marked as PASS.
Additional Notes
- The summary table may include samples that failed coverage or assembly steps, with missing values filled as "NA".
For more details, see the summary
function in QCD_report.smk
.