Microbiome Helper 2 Setting up environments for analysis - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright Modifications by: NA
Note that this is still a work in progress! We don't guarantee that everything will work :)
Launched instance from previous CBW-ICG-2024 image.
Update conda:
conda update conda
Install latest QIIME2:
conda env create \
--name qiime2-amplicon-2025.4 \
--file https://raw.githubusercontent.com/qiime2/distributions/refs/heads/dev/2025.4/amplicon/released/qiime2-amplicon-ubuntu-latest-conda.yml
Got this warning:
For Linux 64, Open MPI is built with CUDA awareness but this support is disabled by default.
To enable it, please set the environment variable OMPI_MCA_opal_cuda_support=true before
launching your MPI processes. Equivalently, you can set the MCA parameter in the command line:
mpiexec --mca opal_cuda_support 1 ...
In addition, the UCX support is also built but disabled by default.
To enable it, first install UCX (conda install -c conda-forge ucx). Then, set the environment
variables OMPI_MCA_pml="ucx" OMPI_MCA_osc="ucx" before launching your MPI processes.
Equivalently, you can set the MCA parameters in the command line:
mpiexec --mca pml ucx --mca osc ucx ...
Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via UCX.
Please consult UCX's documentation for detail.
Install fastqc and multiqc in QIIME2 environment:
conda activate qiime2-amplicon-2025.4
mamba install bioconda::fastqc
mamba install bioconda::multiqc
Update R:
sudo apt update
sudo apt install r-base
Remove previous environments:
conda env remove -n qiime2-amplicon-2024.2-backup
conda env remove -n picrust2
conda env remove -n biobakery3
conda env remove -n anvio-7
conda env remove -n rgi
conda env remove -n checkm
conda env remove -n functional
conda env remove -n taxonomic
Install PICRUSt2 from conda:
conda create -n picrust2-v2.6.2
mamba activate picrust2-v2.6.2
mamba install bioconda::picrust2
Install kneaddata:
mamba create -n kneaddata-v0.12.2
mamba activate kneaddata-v0.12.2
mamba install bioconda::kneaddata
mamba install bowtie2 #unnecessary as already installed
mamba install bioconda::trimmomatic
mamba install bioconda::trf
mamba install bioconda::fastqc
mamba install bioconda::multiqc
mamba install conda-forge::parallel
Install Kraken2:
mamba create -n kraken2-v2.14
mamba activate kraken2-v2.14
mamba install bioconda::bracken
mamba install bioconda::kraken2
mamba install conda-forge::parallel
It always says Kraken version 2.1.3, but it says this when installing:
Package Version Build Channel Size
─────────────────────────────────────────────────────────────
Reinstall:
─────────────────────────────────────────────────────────────
o kraken2 2.14 pl5321h077b44d_0 bioconda Cached
Install Anvi'o
conda create -y --name anvio-8 python=3.10
conda activate anvio-8
mamba install -y -c conda-forge -c bioconda python=3.10 \
sqlite=3.46 prodigal idba mcl muscle=3.8.1551 famsa hmmer diamond \
blast megahit spades bowtie2 bwa graphviz "samtools>=1.9" \
trimal iqtree trnascan-se fasttree vmatch r-base r-tidyverse \
r-optparse r-stringi r-magrittr bioconductor-qvalue meme ghostscript \
nodejs=20.12.2
mamba install -y -c bioconda fastani
curl -L https://github.com/merenlab/anvio/releases/download/v8/anvio-8.tar.gz \
--output anvio-8.tar.gz
pip install anvio-8.tar.gz
mamba install bioconda::concoct
mamba install bioconda::metabat2
mamba install bioconda::maxbin2
mamba install bioconda::das_tool
mamba install bioconda::binsanity
mamba install bioconda::gtdbtk
mamba install usearch
#note that at some point I needed to downgrade scikit-learn to v 1.1.0 when I got pickle errors
#pip install scikit-learn==1.1.0
Install CheckM2:
mamba create -n checkm2 -c bioconda -c conda-forge checkm2
Install RGI (for CARD):
mamba create -n rgi-v6.0.4
mamba activate rgi-v6.0.4
mamba install bioconda::rgi
mamba install conda-forge::parallel
RStudio server:
sudo apt update
sudo apt upgrade
sudo apt-get install r-base
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2025.05.1-513-amd64.deb
sudo gdebi rstudio-server-2025.05.1-513-amd64.deb
Programs for getting a tree of metagenomic reads:
mamba create -n get_tree python=3.9
mamba activate get_tree
mamba install conda-forge::ete3
mamba install conda-forge::pandas
Programs for functional annotation of metagenomic reads with HMMs:
mamba create -n annotate_hmm
mamba activate annotate_hmm
mamba install anaconda::biopython
mamba install bioconda::clustalo
mamba install bioconda::hmmer
mamba install bioconda::raxml
mamba install bioconda::epa-ng
mamba install bioconda::gappa
mamba install conda-forge::r-castor
mamba install conda-forge::ete3
mamba install conda-forge::pandas
These commands may vary depending on your operating system, but I am providing details of what has worked for setup for us.
These are installed using an install of conda/Anaconda that is accessible to everyone in /opt/anaconda3/, and all users are added to a group anaconda.
conda create -n quality_control_feb2026
conda activate quality_control_feb2026
conda install bioconda::fastqc #version 0.12.1
conda install bioconda::multiqc #version 1.33
conda install bioconda::kneaddata #version 0.12.4
conda install conda-forge::parallel #version 20260122
Note that I have found it necessary/helpful to first set: conda config --set channel_priority flexible
conda env create \
--name qiime2-amplicon-2026.1 \
--file https://raw.githubusercontent.com/qiime2/distributions/refs/heads/dev/2026.1/amplicon/released/qiime2-amplicon-ubuntu-latest-conda.yml
Install:
conda env create -n mmseqs2-18.8cc5c
conda activate mmseqs2-18.8cc5c
conda install bioconda::mmseqs2
Create UniRef90 database:
#Done on 2nd Feb 2026
mkdir mmseqs2_db #note that you should first navigate to wherever you'd like to install this!
mmseqs databases UniRef90 mmseqs2_db/UniRef90_2026-01 /tmp
In our functional annotation pipeline, we use information on gene lengths for normalisation, and also map UniRef90 ID's to EC numbers. To do this, we have generated several files that contain this information that are intended to be used with our scripts. Details of the steps required to make these are below, but note that some of these files are large and these did take a while to run.
Get the UniRef90 fasta file so we can get gene length information (this file is ~40GB):
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
Get the file to take UniRef90 to EC mapping information from (and unzip it - note this file is ~500GB uncompressed):
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
gunzip uniprot_trembl.dat.gz
This followed conversation from here.
Get the EC number descriptions:
wget https://ftp.expasy.org/databases/enzyme/enzyme.dat
Make dictionary of EC number to descriptions (Python code):
import os
import pickle
descriptions = {}
for row in open('enzyme.dat', 'r'):
row = row.replace('\n', '').split(' ')
if row[0] == 'ID':
this_id = row[1]
if row[0] == 'DE':
descriptions[this_id] = row[1]
last_row = row
with open('EC_descriptions.dict', 'wb') as f:
pickle.dump(descriptions, f)
Get gene lengths from fasta file - note that these are in amino acid format, so they are multiplied by 3 for our purposes of working with metagenomic data:
from Bio import SeqIO
import pickle
import bz2
import gzip
lengths = {}
with gzip.open('uniref90.fasta.gz', 'rt') as f:
for record in SeqIO.parse(f, "fasta"):
lengths[record.name] = len(str(record.seq))*3
with bz2.BZ2File('GeneLength.pbz2', 'wb') as f:
pickle.dump(lengths, f)
Get mapping from UniRef to EC:
import pickle
import bz2
import gzip
mapping = {}
count = 0
for row in open('uniprot_trembl.dat', 'r'):
if count % 10000: print(count, len(mapping))
count += 1
this_row = row.replace('\n', '').split(' ')
if this_row[0] == 'ID':
this_id = this_row[1]
elif 'EC=' in row:
mapping[this_id] = row.split('EC=')[1].split(' ')[0].replace(';\n', '')
new_mapping = {}
for up in mapping:
new_mapping['UniRef90_'+up.split('_')[0]] = mapping[up]
with bz2.BZ2File('ECmapped_2026-01.pbz2', 'wb') as f:
pickle.dump(new_mapping, f, protocol=0)
Add the previous mapping and HUMANN mapping (v3.6 utility mapping file) to the mapping file that we just made:
import pickle
import bz2
import gzip
humann = 'map_level4ec_uniref90.txt'
humann_map = {}
for row in open(humann, 'r'):
row = row.replace('\n', '').split('\t')
ec = row[0]
for a in range(len(row)):
if a == 0: continue
humann_map[row[a]] = ec
with bz2.BZ2File('ECmapped_2026-01.pbz2', 'rb') as f:
new_mapping = pickle.load(f)
with bz2.BZ2File('/bigpool/shared/mmseqs2_db/UniRef90_Dhwani/ECmapped.pbz2', 'rb') as f:
old_mapping = pickle.load(f)
#keeping priority for age of the databases - new_mapping that I made with 2026-01 download takes priority, then humann v3.6 mapping, then the old one that Dhwani put together
combined_mapping = old_mapping | humann_map | new_mapping
with bz2.BZ2File('ECmapped.pbz2', 'wb') as f:
pickle.dump(combined_mapping, f, protocol=0)
mamba create -n kraken2-v2.17.1
mamba activate kraken2-v2.17.1
conda install bioconda::bracken #version
conda install bioconda::kraken2 #version 2.17.1
conda install conda-forge::parallel #version 20260122
cd /home/robyn/tools/genome_coverage_checker_all/v0.0.4
wget https://github.com/R-Wright-1/genome_coverage_checker/archive/refs/tags/v0.0.4.tar.gz
tar -xvf v0.0.4.tar.gz
cd genome_coverage_checker-0.0.4
conda env create --name GeCoCheck-v0.0.4 -f coveragechecker-env.yaml
conda activate GeCoCheck-v0.0.4
pip install --editable .
wget https://github.com/picrust/picrust2/archive/refs/tags/v2.6.3.zip
unzip v2.6.3.zip
cd picrust2-2.6.3/
conda env create -f picrust2-env.yaml
conda activate picrust2
pip install --editable .