Microbiome Helper 2 Setting up environments for analysis - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright Modifications by: NA

Note that this is still a work in progress! We don't guarantee that everything will work :)

Setup of AWS MH2-2025 image

Launched instance from previous CBW-ICG-2024 image.

Update conda:

conda update conda

Install latest QIIME2:

conda env create \
  --name qiime2-amplicon-2025.4 \
  --file https://raw.githubusercontent.com/qiime2/distributions/refs/heads/dev/2025.4/amplicon/released/qiime2-amplicon-ubuntu-latest-conda.yml

Got this warning:

For Linux 64, Open MPI is built with CUDA awareness but this support is disabled by default.                                                   
To enable it, please set the environment variable OMPI_MCA_opal_cuda_support=true before                                                       
launching your MPI processes. Equivalently, you can set the MCA parameter in the command line:                                                 
mpiexec --mca opal_cuda_support 1 ...                                                                                                          
                                                                                                                                               
In addition, the UCX support is also built but disabled by default.                                                                            
To enable it, first install UCX (conda install -c conda-forge ucx). Then, set the environment                                                  
variables OMPI_MCA_pml="ucx" OMPI_MCA_osc="ucx" before launching your MPI processes.                                                           
Equivalently, you can set the MCA parameters in the command line:                                                                              
mpiexec --mca pml ucx --mca osc ucx ...                                                                                                        
Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via UCX.                                                           
Please consult UCX's documentation for detail.   

Install fastqc and multiqc in QIIME2 environment:

conda activate qiime2-amplicon-2025.4
mamba install bioconda::fastqc
mamba install bioconda::multiqc

Update R:

sudo apt update
sudo apt install r-base

Remove previous environments:

conda env remove -n qiime2-amplicon-2024.2-backup
conda env remove -n picrust2
conda env remove -n biobakery3
conda env remove -n anvio-7
conda env remove -n rgi
conda env remove -n checkm
conda env remove -n functional
conda env remove -n taxonomic

Install PICRUSt2 from conda:

conda create -n picrust2-v2.6.2
mamba activate picrust2-v2.6.2
mamba install bioconda::picrust2

Install kneaddata:

mamba create -n kneaddata-v0.12.2
mamba activate kneaddata-v0.12.2
mamba install bioconda::kneaddata
mamba install bowtie2 #unnecessary as already installed
mamba install bioconda::trimmomatic
mamba install bioconda::trf
mamba install bioconda::fastqc
mamba install bioconda::multiqc
mamba install conda-forge::parallel

Install Kraken2:

mamba create -n kraken2-v2.14
mamba activate kraken2-v2.14
mamba install bioconda::bracken
mamba install bioconda::kraken2
mamba install conda-forge::parallel

It always says Kraken version 2.1.3, but it says this when installing:

Package    Version  Build             Channel        Size
─────────────────────────────────────────────────────────────
  Reinstall:
─────────────────────────────────────────────────────────────

  o kraken2     2.14  pl5321h077b44d_0  bioconda     Cached

Install Anvi'o

conda create -y --name anvio-8 python=3.10
conda activate anvio-8
mamba install -y -c conda-forge -c bioconda python=3.10 \
        sqlite=3.46 prodigal idba mcl muscle=3.8.1551 famsa hmmer diamond \
        blast megahit spades bowtie2 bwa graphviz "samtools>=1.9" \
        trimal iqtree trnascan-se fasttree vmatch r-base r-tidyverse \
        r-optparse r-stringi r-magrittr bioconductor-qvalue meme ghostscript \
        nodejs=20.12.2
mamba install -y -c bioconda fastani
curl -L https://github.com/merenlab/anvio/releases/download/v8/anvio-8.tar.gz \
        --output anvio-8.tar.gz
pip install anvio-8.tar.gz
mamba install bioconda::concoct
mamba install bioconda::metabat2
mamba install bioconda::maxbin2
mamba install bioconda::das_tool
mamba install bioconda::binsanity
mamba install bioconda::gtdbtk
mamba install usearch

#note that at some point I needed to downgrade scikit-learn to v 1.1.0 when I got pickle errors
#pip install scikit-learn==1.1.0

Install CheckM2:

mamba create -n checkm2 -c bioconda -c conda-forge checkm2

Install RGI (for CARD):

mamba create -n rgi-v6.0.4
mamba activate rgi-v6.0.4
mamba install bioconda::rgi
mamba install conda-forge::parallel

RStudio server:

sudo apt update 
sudo apt upgrade
sudo apt-get install r-base
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2025.05.1-513-amd64.deb
sudo gdebi rstudio-server-2025.05.1-513-amd64.deb

Programs for getting a tree of metagenomic reads:

mamba create -n get_tree python=3.9
mamba activate get_tree
mamba install conda-forge::ete3
mamba install conda-forge::pandas

Programs for functional annotation of metagenomic reads with HMMs:

mamba create -n annotate_hmm
mamba activate annotate_hmm
mamba install anaconda::biopython
mamba install bioconda::clustalo
mamba install bioconda::hmmer
mamba install bioconda::raxml
mamba install bioconda::epa-ng
mamba install bioconda::gappa
mamba install conda-forge::r-castor
mamba install conda-forge::ete3
mamba install conda-forge::pandas

Setup on our Langille lab servers

These commands may vary depending on your operating system, but I am providing details of what has worked for setup for us.

These are installed using an install of conda/Anaconda that is accessible to everyone in /opt/anaconda3/, and all users are added to a group anaconda.

Quality control programs (fastqc, multiqc, kneaddata)

conda create -n quality_control_feb2026
conda activate quality_control_feb2026

conda install bioconda::fastqc #version 0.12.1
conda install bioconda::multiqc #version 1.33
conda install bioconda::kneaddata #version 0.12.4
conda install conda-forge::parallel #version 20260122

QIIME2

Note that I have found it necessary/helpful to first set: conda config --set channel_priority flexible

conda env create \
  --name qiime2-amplicon-2026.1 \
  --file https://raw.githubusercontent.com/qiime2/distributions/refs/heads/dev/2026.1/amplicon/released/qiime2-amplicon-ubuntu-latest-conda.yml

MMSeqs2

Install:

conda env create -n mmseqs2-18.8cc5c
conda activate mmseqs2-18.8cc5c

conda install bioconda::mmseqs2

Create UniRef90 database:

#Done on 2nd Feb 2026
mkdir mmseqs2_db #note that you should first navigate to wherever you'd like to install this!
mmseqs databases UniRef90 mmseqs2_db/UniRef90_2026-01 /tmp

Getting information on the gene lengths

In our functional annotation pipeline, we use information on gene lengths for normalisation, and also map UniRef90 ID's to EC numbers. To do this, we have generated several files that contain this information that are intended to be used with our scripts. Details of the steps required to make these are below, but note that some of these files are large and these did take a while to run.

Get the UniRef90 fasta file so we can get gene length information (this file is ~40GB):

wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz

Get the file to take UniRef90 to EC mapping information from (and unzip it - note this file is ~500GB uncompressed):

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
gunzip uniprot_trembl.dat.gz 

This followed conversation from here.

Get the EC number descriptions:

wget https://ftp.expasy.org/databases/enzyme/enzyme.dat

Make dictionary of EC number to descriptions (Python code):

import os
import pickle


descriptions = {}
for row in open('enzyme.dat', 'r'):
  row = row.replace('\n', '').split('   ')
  if row[0] == 'ID':
    this_id = row[1]
  if row[0] == 'DE':
    descriptions[this_id] = row[1]
  last_row = row

with open('EC_descriptions.dict', 'wb') as f:
    pickle.dump(descriptions, f)

Get gene lengths from fasta file - note that these are in amino acid format, so they are multiplied by 3 for our purposes of working with metagenomic data:

from Bio import SeqIO
import pickle
import bz2
import gzip

lengths = {}
with gzip.open('uniref90.fasta.gz', 'rt') as f:
  for record in SeqIO.parse(f, "fasta"):
    lengths[record.name] = len(str(record.seq))*3

with bz2.BZ2File('GeneLength.pbz2', 'wb') as f:
  pickle.dump(lengths, f)

Get mapping from UniRef to EC:

import pickle
import bz2
import gzip

mapping = {}
count = 0
for row in open('uniprot_trembl.dat', 'r'):
  if count % 10000: print(count, len(mapping))
  count += 1
  this_row = row.replace('\n', '').split('   ')
  if this_row[0] == 'ID':
    this_id = this_row[1]
  elif 'EC=' in row:
    mapping[this_id] = row.split('EC=')[1].split(' ')[0].replace(';\n', '')

new_mapping = {}
for up in mapping:
  new_mapping['UniRef90_'+up.split('_')[0]] = mapping[up]

with bz2.BZ2File('ECmapped_2026-01.pbz2', 'wb') as f:
  pickle.dump(new_mapping, f, protocol=0)

Add the previous mapping and HUMANN mapping (v3.6 utility mapping file) to the mapping file that we just made:

import pickle
import bz2
import gzip

humann = 'map_level4ec_uniref90.txt'

humann_map = {}
for row in open(humann, 'r'):
  row = row.replace('\n', '').split('\t')
  ec = row[0]
  for a in range(len(row)):
    if a == 0: continue
    humann_map[row[a]] = ec
    
with bz2.BZ2File('ECmapped_2026-01.pbz2', 'rb') as f:
  new_mapping = pickle.load(f)

with bz2.BZ2File('/bigpool/shared/mmseqs2_db/UniRef90_Dhwani/ECmapped.pbz2', 'rb') as f:
  old_mapping = pickle.load(f)

#keeping priority for age of the databases - new_mapping that I made with 2026-01 download takes priority, then humann v3.6 mapping, then the old one that Dhwani put together
combined_mapping = old_mapping | humann_map | new_mapping

with bz2.BZ2File('ECmapped.pbz2', 'wb') as f:
  pickle.dump(combined_mapping, f, protocol=0)

Kraken 2

mamba create -n kraken2-v2.17.1
mamba activate kraken2-v2.17.1
conda install bioconda::bracken #version
conda install bioconda::kraken2 #version 2.17.1
conda install conda-forge::parallel #version 20260122

GeCoCheck

cd /home/robyn/tools/genome_coverage_checker_all/v0.0.4
wget https://github.com/R-Wright-1/genome_coverage_checker/archive/refs/tags/v0.0.4.tar.gz
tar -xvf v0.0.4.tar.gz
cd genome_coverage_checker-0.0.4

conda env create --name GeCoCheck-v0.0.4 -f coveragechecker-env.yaml
conda activate GeCoCheck-v0.0.4
pip install --editable .

PICRUSt2

wget https://github.com/picrust/picrust2/archive/refs/tags/v2.6.3.zip
unzip v2.6.3.zip
cd picrust2-2.6.3/

conda env create -f picrust2-env.yaml
conda activate picrust2
pip install --editable .
⚠️ **GitHub.com Fallback** ⚠️