Microbiome Helper 2 Long read metagenomic taxonomic and functional profiling - LangilleLab/microbiome_helper GitHub Wiki

Authors: Nidhi Gohil Modifications by: NA

**Please note: We are still testing/developing this so use with caution :)

PB-metagenomics-tools offer several pipelines for taxonomic profiling that use Sourmash (decomposes reads into k-mers), Minimap-MEGAN (aligns HiFi reads with Minimap and summarizes with MEGAN-LR), and Diamond-MEGAN (translation alignment to the nr db and summarizes with MEGAN-LR). Both Minimap-MEGAN and Diamond-MEGAN help profile annotations from NCBI and GTDB, and outputs NCBI taxonomic counts in kraken (kreport) and metaphlan (mpa) formats, which is a plus if the PacBio data needs to be compared with Illumina Kraken-profiled data for the same set of samples. Overall, I like to go with Diamond-MEGAN pipeline as it also performs functional classification and provides annotations based on multiple databases (EC, SEED, InterPro2GO, eggNOG), with a new KEGG option available.

Initial points

The options below are not necessarily best for your data; it is important to explore different options, especially at the read QC stage.

  • PacBio platforms output demultiplexed HiFi data in fastq.gz format. The PB-metagenomic-pipeline input requires the data to be converted to FASTA format. While converting FASTQ to FASTA, make sure that the renaming is done correctly. You should check that only the FASTAs of interest are being specified at each step (e.g. with the ls command).

Requirements

Requirements for running:

Memory and disk space requirements: Running this pipeline using the default settings should require <60GB of memory. Approximately 150GB disk space is required for the DIAMOND-indexed NCBI-nr database. Additionally, depending on the size of the HiFi reads fasta file, 40-200GB disk space may be required per sample (using the default setting for hit_limit:--top 5):

  • Very large HiFi reads fasta files (>2 million reads @ 10kb) can produce protein-SAM outputs 100-180GB in size, which can result in RMA files 20-35GB in size (150-215GB total).
  • Smaller HiFi reads fasta files (<1 million reads @ 8kb) can produce protein-SAM outputs 40-80GB in size, which can result in RMA files 6-11GB in size (40-90GB total).

After completion, the large protein-SAM files can be deleted. The protein-SAM files are not removed automatically in case different formats of RMA files need to be produced by the user. The run time and the size of output protein-SAM files can be greatly reduced by changing the hit_limit setting to -k 5 instead (see Configuring the Analysis section below).

Repository and dependencies:

Snakemake will also need to be installed. Snakemake v8+ is now required, and the workflows have been tested using v8.25.

You can install the snakemake environment file in the main directory snakemake-environment.yaml to obtain snakemake v8.25 and the packages required for cluster execution.

To install snakemake 8.25+, try the provided conda environment file via conda env create -f snakemake-environment.yml, and then activate this environment via conda activate pb-metagenomics-tools to run the workflows.

Alternatively, instructions for installing snakemake using conda can be found here.

  • MEGAN6 community edition (version 6.19.+) (https://software-ab.cs.uni-tuebingen.de/download/megan6/welcome.html) To obtain MEGAN6 community edition, clone the directory as follow: git clone https://github.com/husonlab/megan-ce.git The sam2rma and rma2info binaries are required for this pipeline. Both binaries should be present in the tools bin distributed with MEGAN. The full path to sam2rma and rma2info must be specified in config.yaml. The binaries will be slightly different between the community and ultimate editions. **If you want to access KEGG annotations, use the ultimate edition.

The newest MEGAN mapping file for NCBI-nr accessions should also be downloaded from the MEGAN download page. It must be unpacked to use it. The current unpacked file is ~7GB. The full path to the unpacked database file (ending with .db) must be specified in config.yaml.

The NCBI-nr protein database is available at: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz* The gzipped database (nr.gz) is ~80GB in size. The database must be indexed with DIAMOND prior to running the pipeline. This can be accomplished using the following command: diamond makedb --in nr.gz --db diamond_nr_db --threads 24. This command will result in a diamond_nr_db.dmnd file that is ~150GB. The full path to the diamond_nr_db.dmnd file must be specified in config.yaml.

You can always use a customized protein database, for example a subset of the NCBI nr database.

Configuring the analysis:

To configure the analysis, the main configuration file (config.yaml) and sample configuration file (configs/Sample-Config.yaml) should be edited.

Main configuration file (config.yaml)

The main configuration file contains several parameters, each of which is described in the configuration file.

NEW: You can now change the number of fasta chunks to run with DIAMOND. Previous versions hard-coded this to 4 chunks, which is optimal for a HiFi fasta of 2.5 million reads. Smaller files will benefit from fewer chunks (such as 2 or 1). You can also increase the number of chunks for larger files, but the upper limit is 9. Increasing the number of chunks may also slow down the workflow, and the recommended value for DIAMOND is 4.

Depending on your system resources, you may choose to change the number of threads used in the diamond and sam2rma settings. Additionally, the block_size parameter of diamond will affect the speed of the analysis and memory requirements.

The hit_limit argument allows you to specify the type of hit limit method and corresponding value. You can choose between the --top method or -k method, which are used with the range-culling mode (see DIAMOND documentation). The default is --top 5, meaning a hit will only be deleted if its score is more than 5% lower than that of a higher scoring hit over at least 50% of its query range. Using -k 5 instead means that a hit will only be deleted if at least 50% of its query range is spanned by at least 5 higher or equal scoring hits. In general, the -k method will keep far fewer hits, and specifying -k 1 will keep a single hit per query range. This can be useful for 1) very simple metagenomic communities, or 2) reducing the output file size. If you choose to modify the hit_limit argument, you will want to supply the complete DIAMOND flag (e.g., -k 3 or --top 10).

Finally, consider the minSupportPercent argument, which is the minimum support as percent of assigned reads required to report a taxon. The default in MEGAN is 0.05, but with HiFi the best value appears to be 0.01. This provides an optimal trade-off between precision and recall, with near perfect detection of species down to ~0.04% abundance. To avoid any filtering based on this threshold, use a value of 0 instead. This will report ALL assigned reads, which will potentially include thousands of false positives at ultra-low abundances (<0.01%), similar to results from short-read methods (e.g., Kraken2, Centrifuge, etc). Make sure you filter such files after the analysis to reduce false positives! Note: This parameter will only affect the filtered RMA file; a second unfiltered RMA file is also produced by default.

You must also specify the full paths to sam2rma, rma2info, the MEGAN mapping database file, and the indexed NCBI-nr database (diamond_nr_db.dmnd).

Sample configuration file (configs/Sample-Config.yaml)

The example sample configuration file is called Sample-Config.yaml and is located in the configs/ directory. Here you should specify the sample names that you wish to include in the analysis.

IMPORTANT: All samples specified in the sample configuration file must have a fasta file of HiFi reads (SAMPLE.fasta) in the inputs/ folder. To convert HiFi fastq.gz files to .FASTA files, use following command: seqtk seq -a in.fq.gz > out.fa

The pipeline can be run for any number of samples (though be aware of disk space requirements). You can also configure the file to only run for a subset of the samples present in the inputs/ folder. Please note that if the input files do not follow the naming convention (SAMPLE.fasta), they will not be recognized by the workflow. You can use the actual files or symlinks to those files, both are compatible with snakemake.

Executing Snakemake

Before attempting to run this snakemake analysis, please ensure that the pre-analysis requirements have been satisfied, the analysis has been configured properly (using the general and sample configuration files), and the input files are available in the inputs/ folder.

Start the SCREEN.

Finally, you can execute the workflow using:

snakemake --snakefile Snakefile-diamond-megan.smk --configfile configs/Sample-Config.yaml -j 48 --software-deployment-method conda

There are a couple important arguments that were added here:

-j 48 specifies that there are 48 threads available to use. You should change this to match the resources available. If more threads are specified in the configuration file than are available here, snakemake automatically scales them down to this number.
--software-deployment-method conda allows conda to install the programs and environments required for each step. This is essential.

Upon execution, the first step will be conda downloading packages and creating the correct environment. After, the jobs should begin running. You will see the progress on screen.

Outputs

Successful runs will result in several new directories: