OPERA‐MS‐DB from GTDB release - CSB5/OPERA-MS GitHub Wiki
Generating OPERA-MS database using GTDB release files
This page explains how to produce a fresh OPERA-MS database from the latest GTDB release.
Step 1 - Download genomes and taxonomy files
Download files on the GTDB server. For example:
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_taxonomy_r214.tsv.gz
Step 2 - Untar all genomes files
Because you need to untar all the genomes files, this process can be quite disk space expensive. The current release tarball (release 214.1) is around ~ 75Gb which will need to be untar.
tar xzvf gtdb_genomes_reps_r214.tar.gz
rm gtdb_genomes_reps_r214.tar.gz
Step 3 - Prepare files
Concatenate the two taxonomy files
zcat ar53_taxonomy_r214.tsv.gz bac120_taxonomy_r214.tsv.gz | gzip > all_taxonomy_r214.tsv.gz
List genomes files
find gtdb_genomes_reps_r214 -type f -name '*.fna.gz' | gzip > all_genomes.txt.gz
Step 4 - Run the python script
TQDM is optional but provides progress bars.
python {OPERA-MS}/src_utils/make_operams_db_from_gtdb.py all_genomes.txt.gz all_taxonomy_r214.tsv.gz
By default the script will only make symlinks of the original files, you can move files instead, using the --move
argument. Use --threads {n}
to specify the number of CPUs and --outdir
to specify a different output directory (default is operams_db
).
Step 5: Symlink your database to your OPERA-MS directory
The outdir file must be named OPERA-MS-DB
and be inside your OPERA-MS
directory. Do not symlink the full directory but the folders and files within it (genomes_X
folders and genome_length.txt
file).
cd OPERA-MS
mkdir OPERA-MS-DB
ln -s {absolute_outdir_path}/* OPERA-MS-DB
Step 6 - Run mash
Install mash if not done already:
mamba install -c bioconda mash
Create the input files list and run mash with it. It must be launched within the OPERA-MS root directory. The mash outfile must be genomes.msh
. You can provide more than 4 CPUs by changing the -p
option.
find -L OPERA-MS-DB/ -type f -name '*.fna.gz' > OPERA-MS-DB/genomes_list.txt
mash sketch -o OPERA-MS-DB/genomes.msh -p 4 -l OPERA-MS-DB/genomes_list.txt