Pipeline Overview - CDCgov/phoenix GitHub Wiki

Pipeline Summary:

Pipeline Workflow


QC

  1. PhiX174 read removal and adapter removal using BBDuK

  2. Filtering, trimming, and base correction using fastp that includes:

    • quality trimming with a window size of 20 and quality of 30
    • quality pruning at 3' and 5' ends
    • removal of short reads
    • forced polyG tail trimming
  3. Contamination check of trimmed reads using Kraken2.

Analysis of Trimmed Reads

  1. QC Metrics Generated (all data generated for paired and unpaired reads generated post-trimming):
    • Number of total reads/bases
    • Percent of reads/bases remaining (from raw sequences)
    • Number of Q20/Q30 bases
    • Percent Q20/Q30 bases

Analysis using Trimmed Reads

  1. Gene detection and allele calling for antibiotic resistance (AR) srst2 in gene mode. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized.
    • This step is only run with -entry CDC_PHOENIX
    • The curated database includes genes from these AR gene databases (for specifics on versions see "database updates" section of CHANGELOG.md):
  2. Contamination is checked by using Kraken2 on the trimmed reads.
  3. srst2 MLST
    • This step is only run with -entry CDC_PHOENIX
    • For PHoeNix >=2.0.0 a "custom" MLST database is used (the same one is used for the MLST program). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.

Assembly

  1. Assembly of trimmed reads using SPAdes
  2. Filter reads to remove any scaffolds less than 500bp in length.

QC of Assembled Scaffolds >= 500bps

  1. Assess assembly quality using QUAST and custom scripts
  2. QC Metrics Generated:
  • Trimmed coverage (total trimmed bases / assembly length)
  • Assembly ratio (assembly size / median genome size of species)
    • The NCBI Assembly stats file is calculated based on this file from NCBI.
    • The NCBI Assembly stats file is written in a tab delimited format in the following order
      1. Species
      2. Assembly_Size_Min
      3. Assembly_ Size_Max
      4. Assembly_vMedian
      5. Assembly_ Size_Mean
      6. Assembly_ Size_StDev
      7. Assembly_count
      8. GC_Min
      9. GC_Max
      10. GC_Median
      11. GC_Mean
      12. GC_Stdev
      13. GC_count
      14. CDS_Min
      15. CDS_Max
      16. CDS_Median
      17. CDS_Mean
      18. CDS_Stdev
      19. CDS_count
      20. Consensus_TAXID
    • Standard dev is only calculate for cases where there are have >10 reference genomes

Analysis of Assembled Scaffolds >= 500bps

  1. Assess genome assembly for completeness using BUSCO. This step is only run with -entry CDC_PHOENIX

  2. The mast distance is calculated from a pre-calculated sketch of all complete refseq bacteria created with Mash and the top 20 best isolate matches based on distances are passed into FastANI for increased speed in species ID.

    • Note that because we take the top 20 distances it is possible to get more than 20 isolates passed to FastANI. In other words, if the 20th distance has several isolates that are the same distance from the query sequence all those isolates are passed to FastANI.
  3. Calculate the average nucleotide identity (between genomes) using FastANI to determine species.

  4. Type multiple loci to characterize isolates of microbial species using MLST

    • For PHoeNIx <v2.0.0 the database that is included in the MLST program is used.
    • For PHoeNIx >=v2.0.0 a "custom" MLST database is used (the same one is used for SRST2). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.
  5. AR genes and hypervirulence genes are detected using GAMMA. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized. Plasmid markers are detected with GAMMA-S.

  1. PROKKA is run on the scaffolds to generated a translated .faa file and an annotated .gff file, which will be passed to AMRFinder.
  2. AMRFinderPlus is run and the point mutations are reported in the Phoenix_Output_Report.tsv. The translated .faa and annotated .gff files from PROKKA are passed to AMRFinder as described in the AMRFinder documentation.
  3. In addition to running Kraken2 on the trimmed reads, KRAKEN2 is run on the weighted assembled scaffolds using the same database. This additional step allows us to check if any contamination made it into the assembly and this taxa call will be used if FastANI fails.
  • Kraken2 is also run in its normally on the scaffolds (non-weighted). This step is only run with -entry CDC_PHOENIX

Pipeline QC Checks

Notes on Evaluating Genome Assembly Quality

There are 3 "C"s we are concerned with when evaluating genome assemblies:

  • Contiguity: the size and number of contigs.
  • Completeness: the content of contigs, particularly the gene content.
  • Correctness: ordering and location of contigs.

Auto PASS/FAIL

Evaluating the quality of a genome assembly is more of an art than clear cut rules. The auto "PASS/FAIL" are metrics we deem to be the bare minimum quality standards and are:

  • >30x coverage (default, but can be increased with --coverage in phx >=2.0.0)
  • Assembly ratio stdev <2.58
    • The assembly ratio is the ratio between the total number of bases in the sample assembly compared to the expected genome size.
  • Min assembly length >1,000,000bp
  • <500 scaffolds in assembly
  • Integrity of FASTQ files:
    • Uncorrupted files
    • R1 and R2 must have an equal number of reads
    • There must be reads remaining after trimming steps
    • There must be scaffolds remaining after filtering < 500 bp

In addition to this information, staff should also consider other QC metrics (see more below), what species is being sequenced (some species complexes might have lower quality assemblies) and what you plan to do with the data. If there are particular metrics you are interested in then please submit a feature request for consideration.

WARNINGS

Warnings are defined as "out of line with what is expected and MAY cause problems downstream". The following will produce WARNINGS in the synopsis file:

  • <1,000,000 total reads for each raw and trimmed reads
  • % reads with Q30 average for R1 (<90%) and R2 (<70%) -- Checked for both trimmed and raw reads
  • >200 and < 500 scaffolds
  • Checking that %GC content isn't >2.58 stdev away from the mean %GC content for the species determined
  • Contamination check on kraken trimmed and assembly weighted data
    • <30% unclassified reads/weighted scaffolds
    • <50% of reads/weighted scaffolds assigned to top genus hit
    • Confirm there is only 1 genera with >25% of assigned reads/weighted scaffolds

ALERTS

Alerts are defined as "something to note, but doesn't mean it's a poor-quality assembly". The following will produce ALERTS in the synopsis file:

  • No orphaned reads found after trimming
  • <10 reference genomes for species identified so no stdev for assembly ratio or %GC content calculated
  • >150x coverage or between 30-40x coverage