Pipeline Overview - CDCgov/phoenix GitHub Wiki
Pipeline Summary:
QC
-
PhiX174 read removal and adapter removal using
BBDuK
-
Filtering, trimming, and base correction using
fastp
that includes:- quality trimming with a window size of 20 and quality of 30
- quality pruning at 3' and 5' ends
- removal of short reads
- forced polyG tail trimming
-
Contamination check of trimmed reads using
Kraken2
.
Analysis of Trimmed Reads
- QC Metrics Generated (all data generated for paired and unpaired reads generated post-trimming):
- Number of total reads/bases
- Percent of reads/bases remaining (from raw sequences)
- Number of Q20/Q30 bases
- Percent Q20/Q30 bases
Analysis using Trimmed Reads
- Gene detection and allele calling for antibiotic resistance (AR)
srst2
in gene mode. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized.- This step is only run with
-entry CDC_PHOENIX
- The curated database includes genes from these AR gene databases (for specifics on versions see "database updates" section of CHANGELOG.md):
- This step is only run with
- Contamination is checked by using
Kraken2
on the trimmed reads. srst2
MLST- This step is only run with
-entry CDC_PHOENIX
- For PHoeNix >=2.0.0 a "custom" MLST database is used (the same one is used for the MLST program). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.
- This step is only run with
Assembly
- Assembly of trimmed reads using
SPAdes
- Filter reads to remove any scaffolds less than 500bp in length.
QC of Assembled Scaffolds >= 500bps
- Assess assembly quality using
QUAST
and custom scripts - QC Metrics Generated:
- Trimmed coverage (total trimmed bases / assembly length)
- Assembly ratio (assembly size / median genome size of species)
- The NCBI Assembly stats file is calculated based on this file from NCBI.
- The NCBI Assembly stats file is written in a tab
delimited format in the following order
- Species
- Assembly_Size_Min
- Assembly_ Size_Max
- Assembly_vMedian
- Assembly_ Size_Mean
- Assembly_ Size_StDev
- Assembly_count
- GC_Min
- GC_Max
- GC_Median
- GC_Mean
- GC_Stdev
- GC_count
- CDS_Min
- CDS_Max
- CDS_Median
- CDS_Mean
- CDS_Stdev
- CDS_count
- Consensus_TAXID
- Standard dev is only calculate for cases where there are have >10 reference genomes
Analysis of Assembled Scaffolds >= 500bps
-
Assess genome assembly for completeness using
BUSCO
. This step is only run with-entry CDC_PHOENIX
-
The mast distance is calculated from a pre-calculated sketch of all complete refseq bacteria created with
Mash
and the top 20 best isolate matches based on distances are passed intoFastANI
for increased speed in species ID.- Note that because we take the top 20 distances it is possible to get more than 20 isolates passed to FastANI. In other words, if the 20th distance has several isolates that are the same distance from the query sequence all those isolates are passed to FastANI.
-
Calculate the average nucleotide identity (between genomes) using
FastANI
to determine species. -
Type multiple loci to characterize isolates of microbial species using
MLST
- For PHoeNIx <v2.0.0 the database that is included in the MLST program is used.
- For PHoeNIx >=v2.0.0 a "custom" MLST database is used (the same one is used for SRST2). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.
-
AR genes and hypervirulence genes are detected using
GAMMA
. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized. Plasmid markers are detected withGAMMA-S
.
-
The curated database includes genes from these AR gene databases (for specifics on versions see "database updates" section of CHANGELOG.md):
-
Additional databases are used:
- Database of hypervirulence genes from Russo et al.
- PlasmidFinder
- Reference paper
- PHoeNIx v1.1.0 includes until 2022-03-30 commit 9002e72
PROKKA
is run on the scaffolds to generated a translated.faa
file and an annotated.gff
file, which will be passed to AMRFinder.AMRFinderPlus
is run and the point mutations are reported in thePhoenix_Output_Report.tsv
. The translated.faa
and annotated.gff
files from PROKKA are passed to AMRFinder as described in the AMRFinder documentation.- The database that is included in PHX v1.1.0 is 2022-08-19.1 this matches what is in the combined database.
- In addition to running
Kraken2
on the trimmed reads, KRAKEN2 is run on the weighted assembled scaffolds using the same database. This additional step allows us to check if any contamination made it into the assembly and this taxa call will be used if FastANI fails.
Kraken2
is also run in its normally on the scaffolds (non-weighted). This step is only run with-entry CDC_PHOENIX
Pipeline QC Checks
Notes on Evaluating Genome Assembly Quality
There are 3 "C"s we are concerned with when evaluating genome assemblies:
- Contiguity: the size and number of contigs.
- Completeness: the content of contigs, particularly the gene content.
- Correctness: ordering and location of contigs.
Auto PASS/FAIL
Evaluating the quality of a genome assembly is more of an art than clear cut rules. The auto "PASS/FAIL" are metrics we deem to be the bare minimum quality standards and are:
- >30x coverage (default, but can be increased with --coverage in phx >=2.0.0)
- Assembly ratio stdev <2.58
- The assembly ratio is the ratio between the total number of bases in the sample assembly compared to the expected genome size.
- Min assembly length >1,000,000bp
- <500 scaffolds in assembly
- Integrity of FASTQ files:
- Uncorrupted files
- R1 and R2 must have an equal number of reads
- There must be reads remaining after trimming steps
- There must be scaffolds remaining after filtering < 500 bp
In addition to this information, staff should also consider other QC metrics (see more below), what species is being sequenced (some species complexes might have lower quality assemblies) and what you plan to do with the data. If there are particular metrics you are interested in then please submit a feature request for consideration.
WARNINGS
Warnings are defined as "out of line with what is expected and MAY cause problems downstream". The following will produce WARNINGS in the synopsis file:
- <1,000,000 total reads for each raw and trimmed reads
- % reads with Q30 average for R1 (<90%) and R2 (<70%) -- Checked for both trimmed and raw reads
- >200 and < 500 scaffolds
- Checking that %GC content isn't >2.58 stdev away from the mean %GC content for the species determined
- Contamination check on kraken trimmed and assembly weighted data
- <30% unclassified reads/weighted scaffolds
- <50% of reads/weighted scaffolds assigned to top genus hit
- Confirm there is only 1 genera with >25% of assigned reads/weighted scaffolds
ALERTS
Alerts are defined as "something to note, but doesn't mean it's a poor-quality assembly". The following will produce ALERTS in the synopsis file:
- No orphaned reads found after trimming
- <10 reference genomes for species identified so no stdev for assembly ratio or %GC content calculated
- >150x coverage or between 30-40x coverage