Pipeline Overview - CDCgov/phoenix GitHub Wiki
-
PhiX174 read removal and adapter removal using
BBDuK
-
Filtering, trimming, and base correction using
fastp
that includes:- quality trimming with a window size of 20 and quality of 30
- quality pruning at 3' and 5' ends
- removal of short reads
- forced polyG tail trimming
-
Contamination check of trimmed reads using
Kraken2
.
- QC Metrics Generated (all data generated for paired and unpaired reads generated post-trimming):
- Number of total reads/bases
- Percent of reads/bases remaining (from raw sequences)
- Number of Q20/Q30 bases
- Percent Q20/Q30 bases
- Gene detection and allele calling for antibiotic resistance (AR)
srst2
in gene mode. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized.- This step is only run with
-entry CDC_PHOENIX
- The curated database includes genes from these AR gene databases (for specifics on versions see "database updates" section of CHANGELOG.md):
- This step is only run with
- Contamination is checked by using
Kraken2
on the trimmed reads. -
srst2
MLST- This step is only run with
-entry CDC_PHOENIX
- For PHoeNix >=2.0.0 a "custom" MLST database is used (the same one is used for the MLST program). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.
- This step is only run with
- Assembly of trimmed reads using
SPAdes
- Filter reads to remove any scaffolds less than 500bp in length.
- Assess assembly quality using
QUAST
and custom scripts - QC Metrics Generated:
- Trimmed coverage (total trimmed bases / assembly length)
- Assembly ratio (assembly size / median genome size of species)
- The NCBI Assembly stats file is calculated based on this file from NCBI.
- The NCBI Assembly stats file is written in a tab
delimited format in the following order
- Species
- Assembly_Size_Min
- Assembly_ Size_Max
- Assembly_vMedian
- Assembly_ Size_Mean
- Assembly_ Size_StDev
- Assembly_count
- GC_Min
- GC_Max
- GC_Median
- GC_Mean
- GC_Stdev
- GC_count
- CDS_Min
- CDS_Max
- CDS_Median
- CDS_Mean
- CDS_Stdev
- CDS_count
- Consensus_TAXID
- Standard dev is only calculate for cases where there are have >10 reference genomes
-
Assess genome assembly for completeness using
BUSCO
. This step is only run with-entry CDC_PHOENIX
-
The mast distance is calculated from a pre-calculated sketch of all complete refseq bacteria created with
Mash
and the top 20 best isolate matches based on distances are passed intoFastANI
for increased speed in species ID.- Note that because we take the top 20 distances it is possible to get more than 20 isolates passed to FastANI. In other words, if the 20th distance has several isolates that are the same distance from the query sequence all those isolates are passed to FastANI.
-
Calculate the average nucleotide identity (between genomes) using
FastANI
to determine species. -
Type multiple loci to characterize isolates of microbial species using
MLST
- For PHoeNIx <v2.0.0 the database that is included in the MLST program is used.
- For PHoeNIx >=v2.0.0 a "custom" MLST database is used (the same one is used for SRST2). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.
-
AR genes and hypervirulence genes are detected using
GAMMA
. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized. Plasmid markers are detected withGAMMA-S
.
-
The curated database includes genes from these AR gene databases (for specifics on versions see "database updates" section of CHANGELOG.md):
-
Additional databases are used:
- Database of hypervirulence genes from Russo et al.
-
PlasmidFinder
- Reference paper
- PHoeNIx v1.1.0 includes until 2022-03-30 commit 9002e72
-
PROKKA
is run on the scaffolds to generated a translated.faa
file and an annotated.gff
file, which will be passed to AMRFinder. -
AMRFinderPlus
is run and the point mutations are reported in thePhoenix_Output_Report.tsv
. The translated.faa
and annotated.gff
files from PROKKA are passed to AMRFinder as described in the AMRFinder documentation.- The database that is included in PHX v1.1.0 is 2022-08-19.1 this matches what is in the combined database.
- In addition to running
Kraken2
on the trimmed reads, KRAKEN2 is run on the weighted assembled scaffolds using the same database. This additional step allows us to check if any contamination made it into the assembly and this taxa call will be used if FastANI fails.
-
Kraken2
is also run in its normally on the scaffolds (non-weighted). This step is only run with-entry CDC_PHOENIX
Genes identified by GAMMA are filtered only those with >=98% AA identity and >=90% gene length to be included in GRiPHin Summary. Similarly, for SRST2, genes are filtered to report only those with >=98% NT identity and >=90% gene length to be included in the summary files.
Updates for PHX >=2.2.0:
For entry points that run SRST2 (CDC_PHOENIX, CDC_SCAFFOLDS and CDC_SRA), GRiPHin will "dedup" SRST2 calls that are also identified by GAMMA with the following algorithm:

To provide clarity, the columns in the GRiPHin summary that contain a “Big 5” carbapenemase gene (i.e., blaIMP, blaKPC, blaNDM, blaOXA-48-like, and blaVIM) or an acquired blaOXA gene are highlighted in orange. However, not all alleles of these genes confer carbapenemase activity. Thus to further refine the list of alleles that will be highlighted, the list of alleles are cross checked with data in the β-lactamase Database - Structure and Function (BLDB). Only alleles that are list to have carbapenemase or inhibitor-resistant (IR) carbapenemase activity in the "Functional Information" column of BLDB are highlighted in the GRiPHin summary.
If the MLST source is "assembly" the output is coming from MLST. Here is a list of allele markers:
- '~' : full length novel allele
- '?' : partial match (>min_cov & > min_ID). Default min_cov = 10, Default min_ID=95%
- '-' : Allele is missing
If the MLST source is "reads" the output is coming from SRST2. Here is a list of allele markers:
- '*' : Full length match with 1+ SNP (Novel)
- '?' : edge depth is below N or average depth is below X (Default edge_depth = 2, Default average_depth = 5)
- '-' : No allele assigned, usually because no alleles achieved >90% coverage
There are 3 "C"s we are concerned with when evaluating genome assemblies:
- Contiguity: the size and number of contigs.
- Completeness: the content of contigs, particularly the gene content.
- Correctness: ordering and location of contigs.
Evaluating the quality of a genome assembly is more of an art than clear cut rules. The auto "PASS/FAIL" are metrics we deem to be the bare minimum quality standards and are:
- >30x coverage (default, but can be increased with --coverage in phx >=2.0.0)
- Assembly ratio stdev <2.58
- The assembly ratio is the ratio between the total number of bases in the sample assembly compared to the expected genome size.
- Min assembly length >1,000,000bp
- <500 scaffolds in assembly
- Integrity of FASTQ files:
- Uncorrupted files
- R1 and R2 must have an equal number of reads
- There must be reads remaining after trimming steps
- There must be scaffolds remaining after filtering < 500 bp
In addition to this information, staff should also consider other QC metrics (see more below), what species is being sequenced (some species complexes might have lower quality assemblies) and what you plan to do with the data. If there are particular metrics you are interested in then please submit a feature request for consideration.
Warnings are defined as "out of line with what is expected and MAY cause problems downstream". The following will produce WARNINGS in the synopsis file:
- <1,000,000 total reads for each raw and trimmed reads
- % reads with Q30 average for R1 (<90%) and R2 (<70%) -- Checked for both trimmed and raw reads
- >200 and < 500 scaffolds
- Checking that %GC content isn't >2.58 stdev away from the mean %GC content for the species determined
- Contamination check on kraken trimmed and assembly weighted data
- <30% unclassified reads/weighted scaffolds
- <50% of reads/weighted scaffolds assigned to top genus hit
- Confirm there is only 1 genera with >25% of assigned reads/weighted scaffolds
- FastANI identity <95% or FastANI coverage <90%
- BUSCO <97% match
- SRST2 failed gene detection
Alerts are defined as "something to note, but doesn't mean it's a poor-quality assembly". The following will produce ALERTS in the synopsis file:
- No orphaned reads found after trimming
- <10 reference genomes for species identified so no stdev for assembly ratio or %GC content calculated
- >100x coverage or between 30-40x coverage
We DON'T recommend changing the databases that come with each phx version because, we typically run into new issues for each database update that require fixes to PHoeNIx's code, which can cause issues for reproduciblity and citation. However, we recognize there might be situations where this might be necessary so we are providing the following documentation for this. Be aware that due to our work load we will not be able to assist with errors related to personal changes to PHoeNIx databases. Additionally, some databases are created by editing downloaded information with internal scripts which, at this time, cannot be provided as CDC specific things like paths are hard coded.
The following databases are updated for each new release if the underlying information is new. In other words, if NCBI updates AMRFinder+ database we will add that information and if not then the file is kept the same.
The phiX.fasta was taken from NCBI and has Not changed since the original PHX release
"${baseDir}/assets/databases/REFSEQ_20240124_Bacteria_complete.msh.zx"
Make sure The version of AMRFinder Database used is the same as the one used in the AR Curated Database.
-
Download the necessary database with:
wget -r -p -np -e robots=off https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/<version>/<date_of_release>/
-
Copy the terminal folder to your current location and remove the empty parent directory tree.
mkdir amrfinderdb_v<version>_<date_of_release>/
cp -r ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/<version>/<date_of_release>/* ./amrfinderdb_v<version>_<date_of_release>/.
- Set up database
Note:
For database 3.10_20220819.1 - We used hmmer/3.1b2
For databases >=3.11_20231115 - We created a conda environment for v3.2.0 with mamba create --name hmmer -c bioconda hmmer
cd ./amrfinderdb_v<version>_<database_date>/
module load ncbi-blast+/2.15.0
makeblastdb -in AMRProt -dbtype prot
makeblastdb -in AMR_CDS -dbtype nucl
hmmpress -f AMR.LIB
for f in AMR_DNA-*; do makeblastdb -in $f -dbtype nucl; done
- tar.gz the folder and save in the folder
phoenix/assets/databases
tar -czvf amrfinderdb_v<version>_<date_of_release>.tar.gz amrfinderdb_v<version>_<date_of_release>/
If you rather use amrfinder_update
to pull the latest database do the following to make it work in PHoeNIx.
In this example we are downloading v3.11 the latest release was 20230223.1.
amrfinder_update -d $PWD/amrfinderdb_v3.11_20230223.1
This will create two folders in amrfinderdb_v3.11_20230223.1: latest and 2023-02-23.1. We just want the 2023-02-23.1 folder zipped.
Rename the folder mv 2023-02-23.1 amrfinderdb_v3.11_20230223.1
Zip the file tar -czvf amrfinderdb_v3.11_20230223.1.tar.gz 2023-02-23.1/
Move the file amrfinderdb_v3.11_20230223.1.tar.gz
to the folder phoenix/assets/databases
"${baseDir}/assets/databases/NCBI_Assembly_stats_20240124.txt"
"${baseDir}/assets/databases/taxes_20230516.csv"
"${baseDir}/assets/databases/ResGANNCBI_20240229_srst2.fasta"
"${baseDir}/assets/databases/PF-Replicons_20240124.fasta"
"${baseDir}/assets/databases/HyperVirulence_20220414.fasta"
The genes in the database were derived from Identification of Biomarkers for Differentiation of Hypervirulent Klebsiella pneumoniae from Classical K. pneumoniae.
The database currently has 105 sequences from the following 6 genes:
- rmpA - 19
- rmpA2 - 9
- peg-334 - 16
- iroB - 23
- iucA - 38
####custom_mlstdb "${baseDir}/assets/databases/mlst_db_20240124.tar.gz"
"${baseDir}/assets/databases/nodes_20240129.dmp.gz"
"${baseDir}/assets/databases/names_20240129.dmp.gz"