OPERA‐MS‐DB from GTDB release - CSB5/OPERA-MS GitHub Wiki

Generating OPERA-MS database using GTDB release files

This page explains how to produce a fresh OPERA-MS database from the latest GTDB release.

Step 1 - Download genomes and taxonomy files

Download files on the GTDB server. For example:

wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_taxonomy_r214.tsv.gz

Step 2 - Untar all genomes files

Because you need to untar all the genomes files, this process can be quite disk space expensive. The current release tarball (release 214.1) is around ~ 75Gb which will need to be untar.

tar xzvf gtdb_genomes_reps_r214.tar.gz
rm gtdb_genomes_reps_r214.tar.gz

Step 3 - Prepare files

Concatenate the two taxonomy files

zcat ar53_taxonomy_r214.tsv.gz bac120_taxonomy_r214.tsv.gz | gzip > all_taxonomy_r214.tsv.gz

List genomes files

find gtdb_genomes_reps_r214 -type f -name '*.fna.gz' | gzip > all_genomes.txt.gz

Step 4 - Run the python script

TQDM is optional but provides progress bars.

python {OPERA-MS}/src_utils/make_operams_db_from_gtdb.py all_genomes.txt.gz all_taxonomy_r214.tsv.gz

By default the script will only make symlinks of the original files, you can move files instead, using the --move argument. Use --threads {n} to specify the number of CPUs and --outdir to specify a different output directory (default is operams_db).

Step 5: Symlink your database to your OPERA-MS directory

The outdir file must be named OPERA-MS-DB. Do not symlink the full directory but the folders and files within it (genomes_X folders and genome_length.txt file).

From your OPERA-MS directory:

mkdir OPERA-MS-DB
ln -s {absolute_outdir_path}/* OPERA-MS-DB

Step 6 - Run mash

Install mash if not done already:

mamba install -c bioconda mash

Create the input files list and run mash with it. It must be launched within the OPERA-MS root directory. The mash outfile must be genomes.msh. You can provide more than 4 CPUs by changing the -p option.

find OPERA-MS-DB/ -type f -name '*.fna.gz' > OPERA-MS-DB/genomes_list.tx
mash sketch -o OPERA-MS-DB/genomes.msh -p 4 -l OPERA-MS-DB/genomes_list.txt