QC metrics - Snitkin-Lab-Umich/QCD GitHub Wiki


QC Metrics Wiki

This document describes the Quality Control (QC) metrics generated and summarized in the QCD pipeline. The summary table is output as {prefix}_QC_summary.csv and integrates results from multiple steps in the workflow.

Input Data Sources

The summary function merges the following data sources:

  • Coverage: Read coverage statistics per sample.
  • MLST: Multi-locus sequence typing results.
  • MultiQC FastQC: Raw and trimmed read quality metrics.
  • QUAST: Assembly quality metrics (N50, total length).
  • Contig distribution: Number of contigs per sample.
  • Skani: Closest reference genome and species assignment.
  • Genome size table: Reference genome sizes for length checks.
  • Failed samples: Optionally, samples that failed coverage or assembly.

Output Columns

Column Name Description
Sample Sample identifier.
Total_reads Total number of reads for the sample.
Total_bp Total number of base pairs sequenced.
MeanReadLength Mean read length.
Coverage Estimated genome coverage (X).
Scheme MLST scheme used.
ST Sequence type (from MLST).
After_trim_per_base_sequence_content Per-base sequence content after trimming.
After_trim_overrepresented_sequences Overrepresented sequences after trimming.
After_trim_%GC GC content (%) after trimming.
After_trim_Total Bases Total number of bases after trimming.
After_trim_Total Sequences Total number of sequences after trimming.
After_trim_median_sequence_length Median sequence length after trimming.
After_trim_avg_sequence_length Average sequence length after trimming.
After_trim_total_deduplicated_percentage Percentage of deduplicated sequences after trimming.
After_trim_Sequence length Sequence length after trimming.
After_trim_adapter_content Adapter content after trimming.
N50 N50 value from assembly (length at which 50% of the assembly is contained in contigs of this length or longer).
Total length Total length of the assembly.
Total # of contigs Total number of contigs in the assembly.
QC Check Final QC status for the sample (PASS/FAIL/Run FAIL).
ANI Average Nucleotide Identity to closest reference (from Skani).
Align_fraction_ref Fraction of reference genome aligned.
Align_fraction_query Fraction of query genome aligned.
Ref_name Closest reference genome name.
Species Species assignment based on reference.

QC Check Logic

A sample is marked as FAIL if any of the following conditions are met:

  • Total # of contigs > max_contigs (from config)
  • Total # of contigs < min_contigs (from config)
  • Coverage < coverage (from config)
  • Assembly length is not within ±15% of the expected species genome size
  • Missing contig or assembly length information

Otherwise, the sample is marked as PASS.

Additional Notes

  • The summary table may include samples that failed coverage or assembly steps, with missing values filled as "NA".

For more details, see the summary function in QCD_report.smk.