OPERA MS Utilities - CSB5/OPERA-MS GitHub Wiki

Introduction

OPERA-MS-UTILS provides different utilities to customize OPERA-MS assembly, QC reads from different technologies and perform essential post-assembly metagenomic analysis conveniently:

  • opera-ms-db: generate a custom database used in OPERA-MS's reference-based clustering
  • read-concordance: compute abundance profile correlation between long and short-read sequencing data
  • binning: bin OPERA-MS contigs using MetaBAT2 or MaxBin2
  • bin-evaluation: assess bin quality with CheckM
  • novel-species analysis of OPERA-MS metagenome-assembled-genomes (MAGs) to identify closely related species and MAGs for novel species

Installation

After cloning the repository, most of the tools should be functional. To identify which tools are functional/non-functional, and to know for which utility command they are required, you can use the following command:

python OPERA-MS-UTILS.py check-dependency

For non-functional tools, please refer to the dependency installation guide.

To use the read-concordance and novel-species utilities, tool-specific databases need to be downloaded using the following commands:

# novel-species analysis database
python ../OPERA-MS-UTILS.py utils-db --dbtype novel-species

# read-concordance analysis database [kraken2 database]
python OPERA-MS-UTILS.py utils-db --dbtype read-concordance

Note that, if you already have a Kraken2 database (a path containing {hash,opts,taxo}.k2d index files) you can instead create a symbolic link:

ln -s /path_to/kraken2/db/ OPERA-MS/utils_db/kraken_db

Usage

All OPERA-MS-UTILS commands use a config file out_dir/opera-ms-utils.config which is automatically generated during OPERA-MS assembly.
To test OPERA-MS-UTILS on the OPERA-MS test file, simply use the following commands:

cd  test_files

# read-concordance analysis
python ../OPERA-MS-UTILS.py read-concordance \
                            --config RESULTS_UTILS/opera-ms-utils.config

# bin-evalutation analysis
python ../OPERA-MS-UTILS.py bin-evaluation \
                            --config RESULTS_UTILS/opera-ms-utils.config \
                            --binner opera_ms_clusters

# novel-species identification analysis
python ../OPERA-MS-UTILS.py novel-species \
                            RESULTS_UTILS/opera-ms-utils.config \
                            --out novel_species \
                            --binner opera_ms_clusters

Notice, that we do not provide a test for the binning command as the short-read coverage of the test_files dataset is too low for MetaBAT2 and MaxBin2 to obtain meaningful results.

OPERA-MS-UTILS commands

opera-ms-db

Creation of custom genome database to be used during OPERA-MS's reference-based clustering. The generation process requires a directory containing all genome files, and a file indicating genome names and their taxonomy (only species level is required). As an example, you can find the taxonomy file used to generate OPERA-MS's default database from GTDB representative genomes.

Usage

python OPERA-MS-UTILS.py opera-ms-db \
                         --genomes-dir <genomes> \
                         --taxonomy <taxonomy.txt> \
                         --db-name <DB>

Parameters

-g, --genomes-dir:    Directory that contains genome files
-x, --taxonomy:       Species-level taxonomy of genomes
-d, --db-name:        Database name
-t, --threads:        Number of threads [default: 2]

read-concordance

This QC tool computes the correlation of long and short-read taxonomic profiles (genus/species-level). Taxonomic profiles are computed using Kraken2.

Usage

python OPERA-MS-UTILS.py read-concordance --config <config file>

Parameters

-c, --config:                 OPERA-MS-UTILS config file
-a, --abundance-threshold:    Abundance threshold at which a taxa is considered present [default: 0.1]
-t, --thread:                 Number of threads [default: 2]

Output
Output files can be found in out_dir/read-concordance:

  • correlation_value.txt: Pearson correlation at species and genus level
  • S_abundance_comparison.txt species and G_abundance_comparison.txt genus level abundance profiles for short and long-read data

binning

Streamlined binning of OPERA-MS assembled contigs using MetaBAT2 or MaxBin2.

Usage

python OPERA-MS-UTILS.py binning --config <config file>

Parameters

-c, --config:         OPERA-MS-UTILS config file
-b, --binner:         Binning method [default: MetaBat2] {metabat2, maxbin2}
-s, --sample-name:    Sample name [default: OPERA-MS output folder]
-t, --thread:         Number of threads [default: 2]

Output
Bins are located in out_dir/binner{metabat,maxbin2}/all.

bin-evaluation

Streamlined bin evaluation using CheckM.

Usage

python OPERA-MS-UTILS.py bin-evaluation --config <config file>

Parameters

-c, --config:              OPERA-MS-UTILS config file
-b, --binner:              Bins for evaluation [default: metabat2] {metabat2, maxbin2, opera_ms_clusters}
-H, --high-qual-mags:      Completeness and contamination thresholds for high quality bins [default: 90,5]
-M, --medium-qual-mags:    Completeness and contamination thresholds for medium quality bins [default: 50,10]
-t, --thread:              Number of threads [default: 2]

Output:
Output files can be found in out_dir/binner{metabat2, maxbin2, opera_ms_clusters}:

  • high_quality and medium_quality directories that include sequences for high and medium quality bins
  • bin_info.txt that indicates bin quality (HIGH, MEDIUM, LOW) and various assembly statistics (e.g. completeness, N50, N90)

novel-species

Identification of closely related species to OPERA-MS MAGs and MAGs for novel species by comparison with representative genomes from GTDB v89.0 and representatives MAGs from Pasolli et al and Almeida et al. The procedure followed is the one described in Chng et al. Briefly, each OPERA-MS MAG is compared with reference genomes/MAGs using the Mash distance. OPERA-MS MAGs whose closet genome/MAG has mash distance >0.05 are considered as coming from novel species. Finally, novel MAGs are hierarchically clustered (single linkage with Mash distance) to identify species-level clusters at a threshold of 0.05.
This analysis must be executed after the binning and bin-evaluation commands.

Usage

python OPERA-MS-UTILS.py novel-species <config file(s)> --out <outdir>

Parameters

-o, --out:                  Output directory
-b, --binner:               Bins used for the analysis [default: metabat2] {metabat2, maxbin2, opera_ms_clusters}
-q, --mags-qual:            Quality of the MAGs used [default: high] {medium, high}
-c, --cluster-threshold:    Maximum distance at which 2 genomes are considered to be from the same species [default: 0.05]
-t, --threads:              Number of threads [default: 2]

Output:
Analysis results are summarized in the out_path/MAGs_info.txt file.

Dependencies

  • python 3.6
  • Mash - (tested with version 2.2)
  • Kraken2 - (tested with version 2.0.7)
  • MetaBAT2 - (tested with version 2.12.1)
  • MaxBin2 - (tested with version 2.2.7)
  • CheckM - (tested with version 1.1.2)

Dependency installation guide

Most of the tools have their binaries included, however MaxBin2 and CheckM need to be installed manually. If you have already a local installation of the tool, you can create a symbolic link with the tool name (maxbin2 and/or checkm) in the OPERA-MS/tools_opera_ms directory to allow OPERA-MS-UTILS to use it. If not, the following installation guide should provide the necessary commands and hints for easy installation of these dependencies.


Create an OPERA-MS conda environment

conda create -n opera-ms python=3.6
conda activate opera-ms


MaxBin2 installation using conda

conda install -c bioconda maxbin2
ln -sf $(dirname $(readlink -f `which run_MaxBin.pl`)) tools_opera_ms/maxbin2

Finally the maxbin2 "setting" file should be updated as described here.


CheckM installation using conda

conda install -c bioconda checkm-genome
ln -sf $(readlink -f `which checkm`) tools_opera_ms/

Lastly, checkm data setRoot should be setup as explained here.


Python library installation

pip install pandas==1.0.3
pip install scikit-learn==0.22.2
⚠️ **GitHub.com Fallback** ⚠️