Running PHoeNIx on Different Systems - CDCgov/phoenix GitHub Wiki

How do I Adjust CPUs/Memory Usage?

Nextflow uses config files to determine how many CPUs/Memory to give to jobs which are all found in the conf folder. The test profile has it own config that can be edited to speed up the test sample as its just limited to 2 CPUs.

PHoeNIx is structured so that each process that is run is "labeled" high, medium or low. See SPAdes as and example. The CPUs/Memory requirements are then determined by the process labels in the base.config file. Editing the CPUs/Memory in the base.config for a label well cause all process with that label to use those resources.

If you used git clone for your install then simply edit the base.config file and save it in the same place and you are good to go. However, if you used nextflow run or are running on nextflow tower the best way to change the number of CPUs/Memory for each process is to create a new config file with labels like the base.config and pass it to nextflow with the -c command on CLI.

Configuration for Running PHoeNIx on a High Performance Computing (HPC) Cluster

To run PHoeNIx on an HPC and submit jobs to a cluster you will need to make config file for your executor. We provide a template in the conf folder to edit. How to pass this to PHoeNIx depends on the type of install you are using. For full details on configs see nextflow documentation.

If you used git clone to install

After editing the template in the conf folder you save it in the same place and then run PHoeNIx with:

nextflow run $PATH_TO_INSTALL/phoenix -entry PHOENIX -profile singularity,custom_HPC --input samplesheet.csv --kraken2db $PATH_TO_DB

If you used nextflow pull or nextflow run to install

After editing the template in the conf folder, then you save it to a new location passing it the PHoeNIx with the -c parameter. Now you can run PHoeNIx with:

nextflow run cdcgov/phoenix -r v1.0.0 -entry PHOENIX -profile singularity -c mycustom_config.config --input samplesheet.csv --kraken2db $PATH_TO_DB

Running PHoeNIx with the STaPH-B Toolkit

  1. Follow the directions on the STaPH-B Toolkit Github page for installing the toolkit. PHoeNIx is in version 2 of the toolkit so you will need to update the toolkit if you have it installed, but it is version 1. You will still need Nextflow and Singularity to run PHoeNIx using the toolkit.
  • Run staphb-tk --list_workflows and phoenix should be listed
  1. To run phoenix use staphb-tk phoenix -entry PHOENIX -profile singularity,test --input samplesheet.csv --kraken2db $KRAKEN_DB_PATH --outdir $PWD/toolkit

REMEMBER that KRAKEN_DB_PATH is a variable you create and it needs to be the FULL path to the Kraken DB that INCLUDES a trailing /

Running a different version of PHoeNIx using the toolkit

The default is to just use the latest version of PHoeNIx on github, but you may want to specify the version you want to run so you are consistently running the same version. To run a different version of phoenix use the -wv parameter for the toolkit like:

staphb-tk -wv v1.0.1 phoenix -entry PHOENIX -profile singularity,test --kraken2db $KRAKEN_DB_PATH --input samplesheet.csv --outdir $PWD/toolkit

Remember to up the -wv after staphb-tk AND before phoenix so the argument gets sent to the toolkit. This would be the same as running:**

nextflow run cdcgov/phoenix -r v1.0.1 -entry PHOENIX -profile test --kraken2db $KRAKEN_DB_PATH

Running PHoeNIx with Singularity using the STaPH-B Toolkit

The default for the staphb-tk is to use docker. You can see this when you run PHoeNIx it will print out it's normal information and you can see there is a config file that is being passed by the toolkit (see the red arrow).

Anything that is in this config file will override what are the default settings in the PHoeNIx pipeline per nextflow's hierarchical use of config files.

This file contains the following (as of version 2.0.1 of the toolkit):

To switch and use singularity you just need to pass the singularity config that comes with the toolkit using the -c parameter like this:

staphb-tk -wv v1.0.1 -c $PATH_TO_SINGULARITY_CONFIG phoenix -entry PHOENIX -profile test --kraken2db $KRAKEN_DB_PATH 

Note: here again PATH_TO_SINGULARITY_CONFIG is just a holding place and you need to replace this with the path to the config file. It can be found that ~/staphb_toolkit/config/singularity.config

If you get the following error:

Error executing process > 'PHOENIX:PHOENIX_EXTERNAL:AMRFINDERPLUS_UPDATE (update)'

Caused by:
  Process `PHOENIX:PHOENIX_EXTERNAL:AMRFINDERPLUS_UPDATE (update)` terminated with an error exit status (127)

Command executed:

  mkdir amrfinderdb
  amrfinder_update -d amrfinderdb
  tar czvf amrfinderdb.tar.gz -C $(readlink amrfinderdb/latest) ./
  
  cat <<-END_VERSIONS > versions.yml
  "PHOENIX:PHOENIX_EXTERNAL:AMRFINDERPLUS_UPDATE":
      amrfinderplus: $(amrfinder --version)
      amrfinderplus_db_version: $(head amrfinderdb/latest/version.txt)
  END_VERSIONS

Command exit status:
  127

Command output:
  (empty)

Command error:
  /bin/bash: line 0: cd: /scicomp/scratch/qpk9/d6/85357401e4b77ebb45bcbb01cc73e8: No such file or directory
  /bin/bash: .command.run: No such file or directory

Work dir:
  /scicomp/scratch/qpk9/d6/85357401e4b77ebb45bcbb01cc73e8

You will need to go into the singularity.config file and add singularity.autoMounts = true in a new line at the end of the file, save and rerun the pipeline.

Running PHoeNIx >=v2.1.0 on ICA

  1. Email [email protected] with the subject "PHoeNIx ICA Request" and provide the email address that you would like the phoenix pipeline bundle shared with.

"PHOENIX" is the default entry run, but you can run any of the phoenix entry points you want by typing this into the entry field. Make sure to use ALL CAPS.

Running PHoeNIx <=v1.1.1 on Terra

For states that are using Terra.bio to run bioinformatic workflows like SARS-CoV-2 genomic characterization we have provided PHoeNIx as a workflow that can be imported into your workspace.

  1. Upload your samples just as you would for other analysis. Your metadata.tsv file should contain at minimum the headers entity:sample_id, Read_1, and Read_2 in a tab delaminated file. Once files are uploaded proceed to importing the workflow into your workspace.
  2. Email [email protected], with the subject line "krakenDB invite request" to request access to the sharefile link and provide the email address to send invite to. Download the hash.k2d, opts.k2d, and taxo.k2d files needed for PHoeNIx from the CDC sharefile link. You CANNOT use a different krakenDB for this as it needs to match the ktax_map.k2 file that is included in the pipeline. At this time this is not downloadable via command line.
  3. Upload the hash.k2d, opts.k2d, and taxo.k2d files into a google bucket folder either in your google cloud workspace or under the "Data" tab in your workspace use the left hand navigation pane to go to the "Files" tab and click the blue "Upload" button in the upper right hand corner.
  4. Under the "Data" tab in your workspace use the left hand navigation pane to go to the "Workspace Data" tab. Create a Key/Value pair for the uploaded kraken2 database.
  • The key is just a string of your choosing.
  • The value should be the google bucket location of the kraken2 database folder where the hash.k2d, opts.k2d, and taxo.k2d files can be found. You can find the hyperlink by going to "Files" tab and right clicking on the kraken2_db folder you uploaded and then click "copy link address" in the pop up menu. This can then be copied to the value field in the "workspace data" tab.
- Note that you need to have the kraken2db in this "files" tab. As in Terra needs have permissions to the google bucket where the kraken2db is saved. For more information visit https://support.terra.bio/hc/en-us/articles/360045971452-Accessing-data-from-an-external-bucket

Steps 4

  1. Under the workflows tab click the + blue circle in the "Find a workflow" box to add a new workflow.
  2. In the pop-up window click dockstore, which will take you to the dockstore website where we will search for the PHoeNIx workflow.

Steps 5-6

  1. Search "phoenix" in the search bar of the dockstore webpage.
  2. The PHoeNIx workflow should then appear in the search output. Make sure it says "WDL" under the format column as there is also a nextflow version of the pipeline that will not work on Terra. Click the workflow hyperlink.

Steps 7-8

  1. Now you will should be in a dockstore page for the PHoeNIx workflow. Click on the latest version of the pipeline on the right hand side of the page in the "recent versions" menu. The greyed out "Terra" button should now turn blue if it was not already. Click it.

Note: If you run the "main" branch this will run the latest version of the pipeline, but if you want to run a specific version then you can pick that branch right now the only stable release of PHX is v1.0.0

Step 9

  1. Select the Destination Workspace you want to import the workflow into from the drop down menu. Then click the blue "import" button.

Step 10

  1. Importing the workflow should immediately take you to its workflow space in your chosen workspace. Click the "Outputs" table and then click the "Use defaults" hyperlink to auto fill the output names. If you forget to do this you won't have output saved!

Step 11

  1. Click the "Select Data" blue button and select the samples you want to analyze from the pop up window. Then click save to close the pop up window.
  2. Click the "Inputs tab" and fill in the following REQUIRED fields.
  • kraken2db - workspace.kraken2_db (here kraken2_db should be the key you used when you added the kraken database to your workspace)
  • read1 - this.read_1 (here "read_1" should match the name of the 2nd column in your metadata.tsv file)
  • read2 - this.read_2 (here "read_2" should match the name of the 3rd column in your metadata.tsv file)
  • samplename - this.sample_id (here "sample_id " should match the name of the 1st column in your metadata.tsv file)
  1. There are optional fields for CPU, disk_size and memory that you can adjust if there are errors regarding lack of resources. You will need at least >50GB of memory. DO NOT add anything for the docker option.
  2. Click the blue "Save" button on the right hand side of the page and this will cause the greyed out "RUN ANAYLISIS" to turn blue. Click "RUN ANAYLISIS" and this will launch the pipeline.

Step 12-15

  1. Once your run is complete navigate to the "DATA" tab and click on "sample" on the left hand navigation pane.
  2. Click the blue gear "SETTINGS" button above the table. This will open a pop up menu of all the output that is produced by PHoeNIx.
  3. Select the following fields and place them in this order:
  • qc_outcome
  • warning_count
  • coverage
  • genome_length
  • assembly_ratio
  • scaffold_count
  • species
  • taxa_confidence
  • taxa_source
  • mlst_1
  • mlst_scheme_1
  • mlst_2
  • mlst_scheme_2
  • gc_percent
  • kraken2_trimmed
  • kraken2_weighted
  • beta_lactam_resistance_genes
  • other_ar_genes
  • hypervirulence_genes
  • amrfinder_point_mutations
  • qc_reason
  1. Click "save this column selection" and give it a name so we can load this column selection quicker next time.
  2. These fields will be the same as those in the Phoenix_Output_Report.tsv file that is an overview of the entire run.
  3. All files in from the phoenix run on a sample are available for download in a zipped file that is found in the full_results column. Or you can download a particular file by selecting the column from the "settings" button and then clicking the hyperlink found in that column.
  4. You will probably also want to have a look at the synopsis file that is found in the synopsis column, which well explain the WARNINGS and ALERTS for a particular sample.

Running PHoeNIx >=v2.0.2 on Terra

NOTE: v2.0.0 and v2.0.1 had a bug that will cause Terra to crash so its not available on Terra.

For those using Terra.bio to run bioinformatic workflows like we have provided PHoeNIx as a workflow that can be imported into your workspace.

  1. Upload your samples just as you would for other analysis.
  • -entry PHOENIX and CDC_PHOENIX: Your metadata.tsv file should contain at minimum the headers entity:sample_id, Read_1, and Read_2 in a tab delaminated file.

Steps 1A

  • -entry SCAFFOLDS and CDC_SCAFFOLDS: Your metadata.tsv file should contain at minimum the headers entity:sample_assembly_id and assembly in a tab delaminated file.

Steps 1B

  • -entry SRA and CDC_SRA: Your metadata.tsv file contains requires only one column with the header entity:sample_srr_id in a tab delaminated file.

Steps 1C

Once files are uploaded proceed to importing the workflow into your workspace.

  1. For PHoeNIx >=2.0.0 you will need to download the public Standard-8 version kraken2 database created on or after March 14th, 2023 from Ben Langmead's github page. You CANNOT use an older version of the public kraken databases on Ben Langmead's github page. We thank @BenLangmead and @jenniferlu717 for taking the time to include an extra file in public kraken databases created after March 14th, 2023 to allow them to work in PHoeNIx!

  2. Upload the Kraken2 database, in its compressed .tar.gz form, into a google bucket folder either in your google cloud workspace or under the "Data" tab in your workspace use the left hand navigation pane to go to the "Files" tab and click the blue "Upload" button in the upper right hand corner.

Steps 4

  1. Under the "Data" tab in your workspace use the left hand navigation pane to go to the "Workspace Data" tab. Create a Key/Value pair for the uploaded kraken2 database.
  • The key is just a string of your choosing.
  • The value should be the google bucket location of the kraken2 database folder where the .tar.gz compressed folder can be found. You can find the hyperlink by going to "Files" tab and right clicking on the kraken2_db folder you uploaded and then click "copy link address" in the pop up menu. This can then be copied to the value field in the "workspace data" tab.
- Note that you need to have the kraken2db in this "files" tab. As in Terra needs have permissions to the google bucket where the kraken2db is saved. For more information visit https://support.terra.bio/hc/en-us/articles/360045971452-Accessing-data-from-an-external-bucket

Steps 4

  1. If you already have the PHoeNIx workflow imported the v2.0.2 will be in the drop down menu. If you need to import the workflow see steps 5-10 in the section Running PHoeNIx on Terra <1.1.1 above.

  2. Select a version >=2.0.2 from the drop down Version menu. Click the "Outputs" table and then click the "Use defaults" hyperlink to auto fill the output names. If you forget to do this you won't have output saved!

Step 11

  1. Click the "Select Data" blue button and select the samples you want to analyze from the pop up window. Then click save to close the pop up window.

  2. Click the "Inputs tab" and fill in the following REQUIRED fields.

  • -entry PHOENIX or CDC_PHOENIX
    • entry - "PHOENIX" or "CDC_PHOENIX" (case matters so use all caps and don't forget the quotes.)
    • kraken2db - workspace.kraken2_db (here kraken2_db should be the key you used when you added the kraken database to your workspace)
    • read1 - this.read_1 (here "read_1" should match the name of the 2nd column in your metadata.tsv file)
    • read2 - this.read_2 (here "read_2" should match the name of the 3rd column in your metadata.tsv file)
    • samplename - this.sample_id (here "sample_id " should match the name of the 1st column in your metadata.tsv file)

Steps 13A

  • -entry SCAFFOLDS or CDC_SCAFFOLDS
    • entry - "SCAFFOLDS" or "CDC_SCAFFOLDS" (case matters so use all caps and don't forget the quotes.)
    • kraken2db - workspace.kraken2_db (here kraken2_db should be the key you used when you added the kraken database to your workspace)
    • input_assembly - this.assembly (here "assembly" should match the name of the 2nd column in your metadata.tsv file)
    • samplename - this.sample_assembly_id (here "sample_assembly_id" should match the name of the 1st column in your metadata.tsv file)

Note here the orange box is an optional parameter for -entry SCAFFOLDS or CDC_SCAFFOLDS.

Steps 13B

  • -entry SRA or CDC_SRA
    • entry - "SRA" or "CDC_SRA" (case matters so use all caps and don't forget the quotes.)
    • kraken2db - workspace.kraken2_db (here kraken2_db should be the key you used when you added the kraken database to your workspace)
    • samplename - this.sample_srr_id (here "sample_srr_id" should match the name of the 1st column in your metadata.tsv file)
      NOTE: When running PHoeNIx on Terra the --use_sra argument is not available. The sample names will be the SRR number and not the sample name from NCBI.**

Steps 13C

  1. There are OPTIONAL fields for CPU, disk_size and memory that you can adjust if there are errors regarding lack of resources. You will need at least >50GB of memory. Additionally, there are optional coverage and scaffold_ext fields.
  • Coverage: If you want to increase the coverage cut off >30x enter a number >30 in this field.
  • Scaffold_ext: String that is the FULL extension of the scaffolds files you wish to input. For example, use if your file names are <sample_id>.fa.gz use --scaffolds_ext '.fa.gz'. A regrex of this extension (ex. *.scaffolds.fa.gz) will then be used as default. Assemblies must end in '.fa.gz' or '.fasta.gz'.
  1. Click the blue "Save" button on the right hand side of the page and this will cause the greyed out "RUN ANAYLISIS" to turn blue. Click "RUN ANAYLISIS" and this will launch the pipeline.

Step 12-15

  1. Once your run is complete navigate to the "DATA" tab and click on "sample" on the left hand navigation pane.
  2. Click the blue gear "SETTINGS" button above the table. This will open a pop up menu of all the output that is produced by PHoeNIx.
  3. Select the following fields and place them in this order:
  • qc_outcome
  • warning_count
  • estimated_coverage
  • genome_length
  • assembly_ratio
  • scaffold_count
  • gc_percent
  • species
  • taxa_confidence
  • taxa_coverage
  • taxa_source
  • kraken2_trimmed
  • kraken2_weighted
  • mlst_1
  • mlst_scheme_1
  • mlst_2
  • mlst_scheme_2
  • beta_lactam_resistance_genes
  • other_ar_genes
  • hypervirulence_genes
  • amrfinder_point_mutations
  • qc_reason
  1. Click "save this column selection" and give it a name so we can load this column selection quicker next time.
  2. These fields will be the same as those in the Phoenix_Output_Report.tsv file that is an overview of the entire run.
  3. A great place to start is to download the GRiPHin_Report.xlsx file, which has similar information to the Phoenix_Output_Report.tsv, but is a easier to read and provides a summary of warnings and alerts related for the isolate. For more details see the wiki
  4. All files in from the phoenix run on a sample are available for download in a zipped file that is found in the full_results column. Or you can download a particular file by selecting the column from the "settings" button and then clicking the hyperlink found in that column.
  5. You will probably also want to have a look at the synopsis file that is found in the synopsis column, which well explain the WARNINGS and ALERTS for a particular sample.

Summarizing PHoeNIx >=v2.1.0 output.

Due to how PHoeNIx is run on Terra, the output files Phoenix_Summary.tsv, GRiPhin_Summary.tsv and GRiPHin_Summary.xlsx files only contain one samples information. Ideally, we realize a summary would show all the samples together to easy review like the PHoeNIx CLI version. To do this Terra users can run another workflow, combine_phoenix_output, to combine all the output for a sample set into one file.

  1. Import the combine_phoenix_output workflow just like you did with the PHoeNIx workflow. First, click on Workflows tab in Terra and then click the "Find a Workflow" box. A pop-up window will appear, click "Dockstore" under the "Find Additional Workflows" to go to the Dockstore web page. Once there search for Phoenix and click the combine_phoenix_output workflow.

image

  1. The previous step should have taken you to the home page for the workflow. Under the "Launch with" section on the right hand side of the page click on the green Terra icon. Follow the steps to import the workflow to your desired workspace the same as you did with PHX.

  2. To use the workflow, first run Phoenix (any entry point). Once PHoeNIx has completed successfully, you will find the imported combine_phoenix_output workflow by clicking on Workflows tab in Terra. Select the version of the pipeline to run in the drop down menu (only >=v2.1.0 available). Next, select the sample_set you want to run. Notice we are running a set and not an individual sample!

Note: this workflow is part of the PHoeNIx GitHub so the version of the combine_phoenix_output pipeline needs to match the version of PHoeNIx run on the samples.

  1. A new pop-up menu will open for you to select a sample set you want to have output combined into the same summary file(s). For this workflow all the inputs are optional, which allows you to create only the files you want. There are 3 summary files you can pick to create:
  • GRiPHin_Summary.xlsx: An excel file that summaries the output of PHoeNIx for all samples. Includes QC, Taxa, MLST, AR genes, HV genes and plasmid indicators. Notably, the AR genes in this file are organized 1 gene (with resistance classification) per column. The rows contain %ID/%Coverage/#Contig for each gene. The column headers for Big-5 genes are highlighted for easy identification.
  • GRiPHin_Summary.tsv: A tab delimited file that contains the exact same information as the excel version. However, the merged cells are removed and the file is optimized for easy parsing via command line.
  • Phoenix_Summary.tsv: A tab delimited file that that summaries the output of PHoeNIx for all samples. Includes the same information as the GRiPHin style files with some notable differences. First, in this file the warnings are just counted whereas in GRiPhin both alerts and warnings are provided as a short explanation so you know the exact issue. Secondly, the are genes are provided as a comma separated list in a common separated by resistance type.
  • Phoenix_Summary.tsv: A tab delimited file that that summaries the output of PHoeNIx for all samples. Includes the same information as the GRiPHin style files with some notable differences. First, in this file the warnings are just counted whereas in GRiPhin both alerts and warnings are provided as a short explanation so you know the exact issue. Secondly, the are genes are provided as a comma separated list in a common separated by resistance type.
  • BiosampleAttributes_Microbe.1.0.xlsx and Sra_Microbe.1.0.xlsx are partially filled out excel sheets for uploading to NCBI. As a reminder, please do not submit raw sequencing data to the CDC HAI-Seq BioProject (531911) that is auto populated in this sheet unless you are a state public health laboratory, a CDC partner or have been directed to do so by DHQP. The BioProject accession IDs in this file are specifically designated for domestic HAI bacterial pathogen sequencing data, including from the Antimicrobial Resistance Laboratory Network (AR Lab Network), state public health labs, surveillance programs, and outbreaks. For inquiries about the appropriate BioProject location for your data, please contact [email protected].

For a more detailed explanation of each file see the run specific output section of the wiki.

For each file type you want to create provide the input in the attributes column like so:

Note here we are using the this.samples. style of notation. The plural here denotes we are running this on a sample set rather than just one sample. The use of samples in this notation is because we had our entity type set as sample in the input the imported samplesheet. In other words, the header had "entity:sample" as the first column in the header. If we used "entity:sample_sra" in our input file our notation for this workflow would be this.sample_sra.

image

You have the option of changing the prefix on the file name if you want to be able to distinguish between multiple files using one of the first 3 rows that are the *_prefix variables. This is a string so you need to put quotes around the string or a yellow ! will appear next to the row.

  1. Once your input is set go into the outputs tab, select "Use defaults" as you did with the phoenix workflow.

  2. Click the blue "Save" button on the right hand side of the web page and then "run analysis" when the pop up window opens.

  3. Your output will now show up in the "Data" tab like this:

Click on any of the hyperlinks to download the file.

⚠️ **GitHub.com Fallback** ⚠️