prepare shotgun databases - mucosal-immunology-lab/microbiome-analysis GitHub Wiki
Double check first to see whether the shotgun databases you require have already been prepared on the cluster before running the steps below.
You require the following databases (at a minimum) to run the Sunbeam pipeline and assign taxonomy.
- Host genome(s) for decontamination
- Kraken databases for taxonomy
- Bracken databases (related to kraken2 – for abundance correction)
The recommended smux parameters for database preparation are:
# Start a new interaction session
smux n --time=7-00:00:00 --mem=32GB --cpuspertask=2 --ntasks=1 -J Build-DatabasesWe require the host genomes to remove host reads before metagenomics analysis. Of note, these need to be located in a separate folder, be decompressed, and be of file type .fasta.
For shotgun metagenomics data of human-derived samples, we will combine 2 genomes together to ensure maximum removal of human genetic material.
- CHM13: The telomere-to-telomere consortium CHM13 project genome is the resultant assembly of sequencing the CHM13hTERT human cell line with multiple technologies. The sequencing data included 30x PacBio HiFi, 120x Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. It was highlighted in a 2023 article by Gihawi et al. on the importance of host decontamination in shotgun microbiome data that led to the retraction of a 2020 Nature publication.
- GRCh38 (1000 genomes): This genome is the GRCh38 reference genome from the 1000 genomes project.
The image below provides a quick look at the CHM13 genome vs. GRCh38, but good news – the Y chromosome has been included with the CHM13 genome now!
 
To prepare a combined FASTA file from these two genomes, run the prepare_human_genome.sh script. Make sure you set the desired location for the database to be located.
# Run the human genome preparation script
bash prepare_human_genome.shKraken2 will be used to assign taxonomy – we will go in-depth here and include archaeal, fungal, and viral reference libraries too. Another thing pointed out by Gihawi et al. in their 2023 paper was the importance of including host genomes in the kraken2 database. The rationale behind this is that even if some host genetic material remains in the data at the time of taxonomic assignment, it should be caught and assigned to the host instead of incorrectly annotated as a bacterial read.
As such, we will include human, mouse, and rat libraries into the kraken2 database. Even if you think you will just use human-derived samples, there's no harm in including the mouse and rat libraries just in case.
Prepare and run the following script, prepare_kraken2db.sh, to generate your database. Ensure you alter the KRAKEN2_DIR and KRAKEN2_DB variables to whatever you require.
This process will take quite a long time, and will require more resources than we initially set for the interactive session. You therefore have the option of starting a new interaction smux session, or you can simply submit the job to sbatch using the command below.
# Install kraken2 and download/prepare the required databases
sbatch prepare_kraken2db.shBracken is used to correct species abundances. Prepare and run the following script, prepare_brackendb.sh, ensuring you have set the directory variables to the right values.
Building the bracken database takes a lots of resources, and it is recommended to use 10-20 CPUs for the task. It will takes many hours (or perhaps days) to generate the database using a single CPU. As such, either start a new interactive smux session with more resources, or submit the job via sbatch.
# Install bracken and prepare the database
sbatch prepare_brackendb.sh