Pipeline Overview - CDCgov/phoenix GitHub Wiki

Pipeline Summary:

v2 2 0-with-centar


QC

  1. PhiX174 read removal and adapter removal using BBDuK

  2. Filtering, trimming, and base correction using fastp that includes:

    • quality trimming with a window size of 20 and quality of 30
    • quality pruning at 3' and 5' ends
    • removal of short reads
    • forced polyG tail trimming
  3. Contamination check of trimmed reads using Kraken2.

Analysis of Trimmed Reads

  1. QC Metrics Generated (all data generated for paired and unpaired reads generated post-trimming):
    • Number of total reads/bases
    • Percent of reads/bases remaining (from raw sequences)
    • Number of Q20/Q30 bases
    • Percent Q20/Q30 bases

Analysis using Trimmed Reads

  1. Gene detection and allele calling for antibiotic resistance (AR) srst2 in gene mode. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized.
    • This step is only run with -entry CDC_PHOENIX
    • The curated database includes genes from these AR gene databases (for specifics on versions see "database updates" section of CHANGELOG.md):
  2. Contamination is checked by using Kraken2 on the trimmed reads.
  3. srst2 MLST
    • This step is only run with -entry CDC_PHOENIX
    • For PHoeNix >=2.0.0 a "custom" MLST database is used (the same one is used for the MLST program). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.

Assembly

  1. Assembly of trimmed reads using SPAdes
  2. Filter reads to remove any scaffolds less than 500bp in length.

QC of Assembled Scaffolds >= 500bps

  1. Assess assembly quality using QUAST and custom scripts
  2. QC Metrics Generated:
  • Trimmed coverage (total trimmed bases / assembly length)
  • Assembly ratio (assembly size / median genome size of species)
    • The NCBI Assembly stats file is calculated based on this file from NCBI.
    • The NCBI Assembly stats file is written in a tab delimited format in the following order
      1. Species
      2. Assembly_Size_Min
      3. Assembly_ Size_Max
      4. Assembly_vMedian
      5. Assembly_ Size_Mean
      6. Assembly_ Size_StDev
      7. Assembly_count
      8. GC_Min
      9. GC_Max
      10. GC_Median
      11. GC_Mean
      12. GC_Stdev
      13. GC_count
      14. CDS_Min
      15. CDS_Max
      16. CDS_Median
      17. CDS_Mean
      18. CDS_Stdev
      19. CDS_count
      20. Consensus_TAXID
    • Standard dev is only calculate for cases where there are have >10 reference genomes

Analysis of Assembled Scaffolds >= 500bps

  1. Assess genome assembly for completeness using BUSCO. This step is only run with -entry CDC_PHOENIX

  2. The mast distance is calculated from a pre-calculated sketch of all complete refseq bacteria created with Mash and the top 20 best isolate matches based on distances are passed into FastANI for increased speed in species ID.

    • Note that because we take the top 20 distances it is possible to get more than 20 isolates passed to FastANI. In other words, if the 20th distance has several isolates that are the same distance from the query sequence all those isolates are passed to FastANI.
  3. Calculate the average nucleotide identity (between genomes) using FastANI to determine species.

  4. Type multiple loci to characterize isolates of microbial species using MLST

    • For PHoeNIx <v2.0.0 the database that is included in the MLST program is used.
    • For PHoeNIx >=v2.0.0 a "custom" MLST database is used (the same one is used for SRST2). This database is created by pulling organism, scheme and allele information from a static version of PubMLST.org (https://pubmlst.org/static/data/dbases.xml) to make a database in the form expected by SRST2 and the MLST program.
  5. AR genes and hypervirulence genes are detected using GAMMA. We have curated an AR gene database that is a combination of three AR gene databases with redundancies removed and gene names standardized. Plasmid markers are detected with GAMMA-S.

  1. PROKKA is run on the scaffolds to generated a translated .faa file and an annotated .gff file, which will be passed to AMRFinder.
  2. AMRFinderPlus is run and the point mutations are reported in the Phoenix_Output_Report.tsv. The translated .faa and annotated .gff files from PROKKA are passed to AMRFinder as described in the AMRFinder documentation.
  3. In addition to running Kraken2 on the trimmed reads, KRAKEN2 is run on the weighted assembled scaffolds using the same database. This additional step allows us to check if any contamination made it into the assembly and this taxa call will be used if FastANI fails.
  • Kraken2 is also run in its normally on the scaffolds (non-weighted). This step is only run with -entry CDC_PHOENIX

Understanding GRiPHin Summary

AR Gene Reporting

Genes identified by GAMMA are filtered only those with >=98% AA identity and >=90% gene length to be included in GRiPHin Summary. Similarly, for SRST2, genes are filtered to report only those with >=98% NT identity and >=90% gene length to be included in the summary files.

Updates for PHX >=2.2.0:
For entry points that run SRST2 (CDC_PHOENIX, CDC_SCAFFOLDS and CDC_SRA), GRiPHin will "dedup" SRST2 calls that are also identified by GAMMA with the following algorithm:

Highlighting of "big-5" genes

To provide clarity, the columns in the GRiPHin summary that contain a “Big 5” carbapenemase gene (i.e., blaIMP, blaKPC, blaNDM, blaOXA-48-like, and blaVIM) or an acquired blaOXA gene are highlighted in orange. However, not all alleles of these genes confer carbapenemase activity. Thus to further refine the list of alleles that will be highlighted, the list of alleles are cross checked with data in the β-lactamase Database - Structure and Function (BLDB). Only alleles that are list to have carbapenemase or inhibitor-resistant (IR) carbapenemase activity in the "Functional Information" column of BLDB are highlighted in the GRiPHin summary.

MLST Allele Symbols

If the MLST source is "assembly" the output is coming from MLST. Here is a list of allele markers:

  • '~' : full length novel allele
  • '?' : partial match (>min_cov & > min_ID). Default min_cov = 10, Default min_ID=95%
  • '-' : Allele is missing

If the MLST source is "reads" the output is coming from SRST2. Here is a list of allele markers:

  • '*' : Full length match with 1+ SNP (Novel)
  • '?' : edge depth is below N or average depth is below X (Default edge_depth = 2, Default average_depth = 5)
  • '-' : No allele assigned, usually because no alleles achieved >90% coverage

Pipeline QC Checks

Notes on Contamination detection

Internal testing with in silico data shows PHoeNIx can detect approximately >=15% contamination from another species. We are developing methods to improve detection to a lower limit of >=5% as above this percentage contamination can cause false positive detection of AR genes. These methods will likely be in the v2.3.0 release of PHoeNIx.

Notes on Evaluating Genome Assembly Quality

There are 3 "C"s we are concerned with when evaluating genome assemblies:

  • Contiguity: the size and number of contigs.
  • Completeness: the content of contigs, particularly the gene content.
  • Correctness: ordering and location of contigs.

Auto PASS/FAIL

Evaluating the quality of a genome assembly is more of an art than clear cut rules. The auto "PASS/FAIL" are metrics we deem to be the bare minimum quality standards and are:

  • >30x coverage (default, but can be increased with --coverage in phx >=2.0.0)
  • Assembly ratio stdev <2.58
    • The assembly ratio is the ratio between the total number of bases in the sample assembly compared to the expected genome size.
  • Min assembly length >1,000,000bp
  • <500 scaffolds in assembly
  • Integrity of FASTQ files:
    • Uncorrupted files
    • R1 and R2 must have an equal number of reads
    • There must be reads remaining after trimming steps
    • There must be scaffolds remaining after filtering < 500 bp

In addition to this information, staff should also consider other QC metrics (see more below), what species is being sequenced (some species complexes might have lower quality assemblies) and what you plan to do with the data. If there are particular metrics you are interested in then please submit a feature request for consideration.

WARNINGS

Warnings are defined as "out of line with what is expected and MAY cause problems downstream". The following will produce WARNINGS in the synopsis file:

  • <1,000,000 total reads for each raw and trimmed reads
  • % reads with Q30 average for R1 (<90%) and R2 (<70%) -- Checked for both trimmed and raw reads
  • >200 and < 500 scaffolds
  • Checking that %GC content isn't >2.58 stdev away from the mean %GC content for the species determined
  • Contamination check on kraken trimmed and assembly weighted data
    • <30% unclassified reads/weighted scaffolds
    • <50% of reads/weighted scaffolds assigned to top genus hit
    • Confirm there is only 1 genera with >25% of assigned reads/weighted scaffolds
  • FastANI identity <95% or FastANI coverage <90%
  • BUSCO <97% match
  • SRST2 failed gene detection

ALERTS

Alerts are defined as "something to note, but doesn't mean it's a poor-quality assembly". The following will produce ALERTS in the synopsis file:

  • No orphaned reads found after trimming
  • <10 reference genomes for species identified so no stdev for assembly ratio or %GC content calculated
  • >100x coverage or between 30-40x coverage

Documentation for Databases

Standard Databases

We DON'T recommend changing the databases that come with each PHoeNIx version because, we typically run into new issues for each database update that require fixes to PHoeNIx's code, which can cause issues for reproducibility and citation. However, we recognize there might be situations where this might be necessary, so we are providing the following documentation for this. Be aware that due to our workload we will not be able to assist with errors related to personal changes to PHoeNIx databases. Additionally, some databases are created by editing downloaded information with internal scripts which have not been generalized yet, i.e. contain hard-coded paths.

The following databases are updated for each new release if the underlying information is new. In other words, if NCBI updates AMRFinder+ database we will add that information and if not, then the file is kept the same.

bbdukdb

The phiX.fasta was taken from NCBI and has Not changed since the original PHX release.

zipped_sketch

All complete RefSeq bacteria entries are downloaded in fasta form using the ncbi datasets tool using the command:
datasets download genome taxon bacteria --assembly-level "complete, chromosome" --include genome --annotated --assembly-source RefSeq --exclude-atypical
Before, January 2024 the ncbi_genome_downloader tool was used. Next, to aid in downstream functionality the fasta files are unzipped and taxonomy is added to the filenames. Filters are then applied to remove unidentified, unverified, metagenomic, and endosymbionts. Any fasta files containing these identifiers are moved to a different folder and are tracked. Lastly, a mash sketch is created using the naming convention REFSEQ_YYYYMMDD_Bacteria_complete.msh.xz.

amrfinder_db

Make sure the version of AMRFinder Database used is the same as the one used in the AR Curated Database.

  1. Download the necessary database with:
    wget -r -p -np -e robots=off https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/<version>/<date_of_release>/

  2. Copy the terminal folder to your current location and remove the empty parent directory tree.

mkdir amrfinderdb_v<version>_<date_of_release>/
cp -r ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/<version>/<date_of_release>/* ./amrfinderdb_v<version>_<date_of_release>/.
  1. Set up database

Note:
For database 3.10_20220819.1 - We used hmmer/3.1b2
For databases >=3.11_20231115 - We created a conda environment for v3.2.0 with mamba create --name hmmer -c bioconda hmmer

cd ./amrfinderdb_v<version>_<database_date>/
module load ncbi-blast+/2.15.0
makeblastdb -in AMRProt -dbtype prot
makeblastdb -in AMR_CDS -dbtype nucl
hmmpress -f AMR.LIB
for f in AMR_DNA-*; do makeblastdb -in $f -dbtype nucl; done
  1. tar.gz the folder and save in the folder phoenix/assets/databases
    tar -czvf amrfinderdb_v<version>_<date_of_release>.tar.gz amrfinderdb_v<version>_<date_of_release>/

If you rather use amrfinder_update to pull the latest database do the following to make it work in PHoeNIx.
In this example we are downloading v3.11 and the latest release was 20230223.1.

amrfinder_update -d $PWD/amrfinderdb_v3.11_20230223.1

This will create two folders in amrfinderdb_v3.11_20230223.1: latest and 2023-02-23.1. We just want the 2023-02-23.1 folder zipped.

Rename the folder mv 2023-02-23.1 amrfinderdb_v3.11_20230223.1

Zip the file tar -czvf amrfinderdb_v3.11_20230223.1.tar.gz 2023-02-23.1/

Move the file amrfinderdb_v3.11_20230223.1.tar.gz to the folder phoenix/assets/databases

ncbi_assembly_stats

NCBI_Assembly_stats_<date_of_creation>.txt: This database is created from downloading and aggregating NCBI’s Prokaryotes File.

After the file is downloaded, it is converted to a TSV and the TaxID is used to group and calculate the sequence stats, min/max/mean/median/stdev for assembly length, gc content, and cds counts, for each grouping.

taxa - DEPRECATED as of <= v1.1.1

taxes_<date_of_creation>.csv: This database was a manually created quick look-up for the upper taxonomy of known genera. PHoeNIx now uses NCBIs names and nodes files instead, described below.

ardb

ResGANNCBI_<date_of_creation>_srst2.fasta:

We constructed an AR gene database using the non-redundant entries from:

NCBI AMRFinder
ResFinder
ARG-ANNOT

The following clean up steps are performed:

  • Genes were set to the correct reading frame
  • Any sequences including premature stop codons (i.e., not at the end of the gene, this also included operons with multiple genes) were removed.
  • Known point mutations are removed, currently consists of only Eat(A)
  • Delete known partial genes, currently only blaSHV-39 and blaSHV-122

gamdbpf

PF-Replicons_<date_of_creation>.fasta: The plasmid database used in PHoeNIx is a slightly modified version of the PlasmidFinder database. To allow ease of access, the fasta files which are spread across the many folders are deduplicated since some genes occur in multiple classes and then combined into a single fasta.

hvgamdb

HyperVirulence_<date_of_creation>.fasta: The genes in the database were derived from Identification of Biomarkers for Differentiation of Hypervirulent Klebsiella pneumoniae from Classical K. pneumoniae.

The database currently has 105 sequences from the following 5 genes:

  • rmpA - 19
  • rmpA2 - 9
  • peg-334 - 16
  • iroB - 23
  • iucA - 38

MLST

mlst_db_<date_of_creation>.tar.gz: This MLST database mirrors what is available in pubMLST's public databases, and is created using Torsten Seemans Updater. For both MLST tools within PHoeNIx, the output must match this general structure, 'db/pubMLST/species_scheme_name/profile_and_allele_files'. Custom databases can be specified as well, if they match this structure.

*The script used to create this has recently become unusable to access data submitted to pubMLST after 12/31/2024. A new script is being developed to create this same complete database while adhering to pubMLSTs new API access requirements.

To pass a custom, or newer than packaged, mlst database to PHX, use --custom_mlstdb to a zipped file with the following contents (the number of allele files will vary depending on species)

mlst_db_structure

Names & Nodes

names_<date_of_creation>.dmp.gz and nodes_<date_of_creation>.dmp.gz:
These files are downloaded from NCBI and are contained within the taxonomy dump zip file. These files allow linking of TaxIDs to create a fully detailed taxonomy list.

Species Specific Databases

CENTAR

Toxicity Genes

Cdiff_toxins_srst2_<date_of_creation>.fasta:
This NT fasta database was created manually based on the requests of the Clostridioides difficile Subject Matter Experts. The genes have had slight modifications since initial creation, but has generally stayed static. Genes in this database include tcdA/B/C/R/E, cdtA/B/R, and known non-toxicity genes. No specific script or tools were used to find or create this database.

Known Cdiff plasmid sequences

Cdiff_plasmids_<date_of_creation>.fasta:
Another manually created fasta database of known Clostridioides difficile plasmid sequences from literature searches. No specific script or tools were used to find or create this database.

Previously linked ST-RTs

CEMB_ST_RT_Crosswalk_<date_of_creation>.fasta:
This look-up file was created based upon our previous work in trying to link STs to RTs. Although incomplete, it does add value to some of the stronger links discovered. <>

Clostridioides difficile specific AR genes of importance

These databases were manually curated by our team as a result of a literature search to identify cdiff specific antimirobial resistance genes. The majority discovered were found to be protein sequence specific and are represented in the AA fasta file. One gene has a point mutation conferring resistance and is described in the NT fasta file.

centar_ar_db_wt_AA_<date_of_creation>.fasta: centar_ar_db_wt_NT_<date_of_creation>.fasta:

Clostridioides difficile Diffbase Files

These databases were downloaded from the work done by the University of Waterloo (https://diffbase.uwaterloo.ca/). The fasta file consists of protein sequence references for alleles of the tcdA and tcdB toxicity genes. The definitions file provides common designations of toxinotypes which can be assigned using the pairing of subtypes for tcdA and tcdB.

Diffbase_Toxin_AB_<date_of_creation>.fa: Diffbase_toxinotype_definitions_<date_of_creation>.tsv:

⚠️ **GitHub.com Fallback** ⚠️