Basic Metagenomics Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki
We use the "Minoche" filtering method from illumina-utils (https://github.com/merenlab/illumina-utils), followed by cutadapt (https://cutadapt.readthedocs.io/en/stable/) to make sure there are no illumina adaptor sequences on the raw reads (ICBR does NOT remove adaptors). To begin, create a config file for each of your samples. Stephanie wrote this handy little script to make the config files for you from the sequencing file names. This is particularly handy if you have many libraries to process.
This is an example config file. Change the "project_name" to your sample name and replace email and full path as appropriate for your data. When sequencing on the NextSeq, you will get 4 R1 reads and 4 R2 reads for each sample, simply list them all in the config file. The output of quality-filtered reads will be just a single R1 and a single R2 for the sample. Repeat for each metagenome library.
[general]
project_name = 14-73
researcher_email = [email protected]
input_directory = /blue/juliemeyer/stephlei/ToxicCyanos/
output_directory = /blue/juliemeyer/stephlei/ToxicCyanos/
[files]
pair_1 = 14-73_S9_L001_R1_001.fastq.gz, 14-73_S9_L002_R1_001.fastq.gz
pair_2 = 14-73_S9_L001_R2_001.fastq.gz, 14-73_S9_L002_R2_001.fastq.gz
Note: using only trimmomatic may also work well - assess how many reads you lose during the quality filtering to determine if you need a finer tool or coarser tool.
The memory per cpu in this example is the maximum values for our research subscription. If you need more memory, you have the option to submit a script with "burst" (basically a low-priority, but high memory scheduler). See UFRC page here: https://help.rc.ufl.edu/doc/Account_and_QOS_limits_under_SLURM for more details. Also note that I collect qc data on the raw reads at the beginning, on minoche filtered reads in the middle (minoche filtering will give a stats reports automatically), and on trimmomatic processed reads at the end. Best practice is to keep all of this qc data, as you should be reporting, at a minimum, how many reads went into your assemblies.
#!/bin/bash
#SBATCH --job-name=QC_%j
#SBATCH --output=QC_%j.log
#SBATCH --error=QC_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b
module load fastqc
fastqc *.fastq.gz
### don't forget to first create config files for each library; put the config files in the same directory as your reads
module load illumina-utils/2.11
iu-filter-quality-minoche --visualize-quality-curves config_14-73.txt
iu-filter-quality-minoche --visualize-quality-curves config_16-8.txt
gzip *.fastq
#remove illumina adaptors
module load trimmomatic
trimmomatic PE -phred33 -summary 14-73_trim_stats.txt 14-73-QUALITY_PASSED_R1.fastq.gz 14-73-QUALITY_PASSED_R2.fastq.gz 14-73_trimmed_paired_R1.fastq.gz 14-73_trimmed_unpaired_R1.fastq.gz 14-73_trimmed_paired_R2.fastq.gz 14-73_trimmed_unpaired_R2.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-PE.fa:2:30:10 ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -summary 16-8_trim_stats.txt 16-8-QUALITY_PASSED_R1.fastq.gz 16-8-QUALITY_PASSED_R2.fastq.gz 16-8_trimmed_paired_R1.fastq.gz 16-8_trimmed_unpaired_R1.fastq.gz 16-8_trimmed_paired_R2.fastq.gz 16-8_trimmed_unpaired_R2.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-PE.fa:2:30:10 ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
fastqc *trimmed_paired_R*.fastq.gz
Use the program nonpareil with quality-filtered reads (and after mapping to host/algal symbiont genomes, if applicable).
gunzip *.fastq.gz
module load nonpareil/3.3.4
nonpareil -s McAH_nonclad_1.fastq -T kmer -f fastq -b McAH
gzip *.fastq
My current favorite assembler is megahit, but I have also used IDBA-UD and metaSPAdes. Here is an example script for megahit. A decent sized metagenome sequencing reads (ex. ~3GB each for R1 and R2 quality-filtered and cutadapt processed fastq files or ~40 million reads) will probably take a few days to assemble. Follow the assembly with "quast" to get an idea of how well the assembly worked. You are looking for fewer, longer contigs rather than tons of little contigs (< 5Kbp). A good metagenome assembly will have the longest contig in the hundreds of thousands of basepairs.
#!/bin/bash
#SBATCH --job-name=metag_%j
#SBATCH --output=metag_%j.log
#SBATCH --error=metag_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b
module load megahit
module load quast
megahit -1 14-73_trimmed_paired_R1.fastq.gz -2 14-73_trimmed_paired_R2.fastq.gz -o 14-73_megahit_out/ --min-contig-len 1000
quast.py -o quast_14-73_Megahit 14-73_megahit_out/final.contigs.fa
megahit -1 16-8_trimmed_paired_R1.fastq.gz -2 16-8_trimmed_paired_R2.fastq.gz -o 16-8_megahit_out/ --min-contig-len 1000
quast.py -o quast_16-8_Megahit 16-8_megahit_out/final.contigs.fa
- original sequencing reads
- final quality-filtered reads used for assembly
- fastqc of original raw sequencing reads
- fastqc of final quality-filtered reads used for assembly
- nonpareil estimates of sequencing coverage
- quast report (pdf)
- fasta of assembled genome (rename it with strain name, for example, "MM17-34_contigs.fasta" instead of "contigs.fasta")
There are many ways to analyze metagenomic data, but the most likely first thing you will want to do with the assemblies is pull out MAGs. This will complement analysis of assembled metagenomes. Hopefully you will get a few high quality MAGs for each metagenome, from the most dominant organisms in your sample.
(FYI, I tried binsanity extensively and it takes a much longer time than metabat, with more confusing output). So, for now my favorite binner is MetaBAT. The first step will be to simplify the definition line of the fasta file containing the assembled metagenome before the mapping - the mapping may be used for several downstream analyses, including anvi'o, so keep the sorted bam file.
If you are using IMG annotated fastas for mapping, you can simplify the definition lines with this simple sed command in place of the binsanity simplify-fasta command in the script below (THUS preserving the IMG contig numbers for future reference). Here, I am fixing the definition lines of the fasta nucleic acid (fna) file from the IMG download to a more informative name (to me) ending with .fa (which is needed for anvi'o). Do this for each fasta file:
sed 's/>.*: />/g' 2574179787.fna > E_2574179787.fa
Then, you are mapping the reads to the assembled metagenome so that MetaBAT will have coverage data to help parse out MAGs. Additional information on MetaBAT here: https://bitbucket.org/berkeleylab/metabat/wiki/Best%20Binning%20Practices
#!/bin/bash
#SBATCH --job-name=map_bin_%j
#SBATCH --output=map_bin_%j.log
#SBATCH --error=map_bin_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b
# simplify the definition lines of the fasta file
module load binsanity
simplify-fasta -i /blue/juliemeyer/stephlei/ToxicCyanos/14-73_contigs.fa -o /blue/juliemeyer/stephlei/ToxicCyanos/14-73_simple.fasta
# map sequencing reads to assembly with simplified definition line
module load bowtie2
mkdir 14-73_map/
cd 14-73_map/
bowtie2-build -f ../14-73_simple.fasta map_14-73
bowtie2 -q --phred33 -x map_14-73 -1 ../14-73_trimmed_paired_R1.fastq.gz -2 ../14-73_trimmed_paired_R2.fastq.gz -S map_14-73.sam
module load samtools
samtools view -bS -o map_14-73.bam map_14-73.sam
samtools sort map_14-73.bam -o map_14-73_sort.bam
samtools index map_14-73_sort.bam
cd ../
module load metabat/2.13
jgi_summarize_bam_contig_depths --outputDepth 14-73_depth.txt blue/juliemeyer/stephlei/ToxicCyanos/14-73_map/*sort.bam
metabat2 -i 14-73_contigs.fa -o metabat_14-73 -a 14-73_depth.txt -t 24
mkdir metabat_14-73_fastas
mv metabat*fa metabat_14-73_fastas
module load checkm
checkm lineage_wf -t 8 -x fa /blue/juliemeyer/stephlei/ToxicCyanos/metabat_14-73_fastas/ /blue/juliemeyer/stephlei/ToxicCyanos/metabat_14-73_fastas/checkm/
*Keep the sorted bam file from the mapping. You will also have a directory of the fasta files for each MAG (bin). The checkm summary will tell you the broad taxonomy of each MAG, plus the completeness and contamination of the genome (a population genome). The map_bin_[jobid].log is the summary of the checkm process, I copy this file into something more meaningful. This works from the command line, but doesn't seem to work within the script.
cp map_bin_*.log HH0818_checkm.txt
A very useful resource for confirming the quality and taxonomy of the MAGs beyond what checkm tells you is the MiGA Online tool. Create a free account to save your results. http://enve-omics.ce.gatech.edu:3000/
What you do from this point will depend on the goals of your project. First, I like to assess how good the sequencing was with nonpareil. This gives an estimate of the coverage of your sequencing effort by a rarefaction analysis of your metagenome reads. In other words, did I sequence enough to capture the diversity present in my library sample? Submit the quality controlled R1 fastq.gz file to the online portal along with your email address and an informative job name here: http://enve-omics.ce.gatech.edu/nonpareil/ I save the coverage graph by right-clicking it and report the coverage estimate as part of my metagenome stats.
Next, you will likely want to annotate the assembled metagenomes and/or MAGs. I prefer to do this through IMG/MER (https://img.jgi.doe.gov/cgi-bin/mer/main.cgi) where I can keep my data private (and shareable within our group) for up to 2 years, but can also easily compare to publicly available genomes and metagenomes with a convenient web interface. Sign up for a free account with your UF email and you are good to go. You can also annotate metagenomes with MG-RAST or genomes/MAGs with RAST. I personally don't find MG-RAST very user-friendly or sophisticated, but it can be convenient for a fast, quick-and-dirty look at the data. I like RAST annotations for genomes/MAGs because you can get a hierarchical functional classification that is easy to compare across groups/organisms. For example, while individual genes may not be shared across genomes, maybe the functions are still shared, but accomplished by different genes. Having that hierarchy allows me to look at the broader functions, not just specific genes. Lastly, you can always also annotate metagenomes or genomes via command line (and it is part of a typical anvio workflow) using Prokka.
Anvio is a visualization tool that can be used for looking at both metagenomes or genomes. It can be used for an alternative binning method of metagenomes (or a way to look at bins you've generated in other programs). (I haven't used it this way yet, so I don't have much to say on it). See the anvio tutorial on metagenomes here: http://merenlab.org/2016/06/22/anvio-tutorial-v2/
You may also want to do comparative genomics of MAGs and/or whole genomes ("pangenome" type analysis), which is well-facilitated by anvio. An example of this is my BBD metagenomes paper (https://www.frontiersin.org/articles/10.3389/fmicb.2017.00618/full). See the anvio tutorial on pangenome analyis: http://merenlab.org/2016/11/08/pangenomics-v2/
You might want to know the abundance of specific otus in a metagenomic dataset using the quality controlled reads (rather than the assembled metagenome which has lost the abundance data), see singleM OR you want to know the phylogenetic makeup of your assembled metagenome, see graftM.
Here I want to see the phylogenetic makeup of the reads (here just R1) from 15 metagenome libraries:
module load graftm
graftM graft --forward McAH-QUALITY_PASSED_R1_cut.fastq.gz McAI-QUALITY_PASSED_R1_cut.fastq.gz McBH-QUALITY_PASSED_R1_cut.fastq.gz McBI-QUALITY_PASSED_R1_cut.fastq.gz McD1-QUALITY_PASSED_R1_cut.fastq.gz McD2-QUALITY_PASSED_R1_cut.fastq.gz McD3-QUALITY_PASSED_R1_cut.fastq.gz McD4-QUALITY_PASSED_R1_cut.fastq.gz McD6-QUALITY_PASSED_R1_cut.fastq.gz McEH-QUALITY_PASSED_R1_cut.fastq.gz McEI-QUALITY_PASSED_R1_cut.fastq.gz McGH-QUALITY_PASSED_R1_cut.fastq.gz McGNI-QUALITY_PASSED_R1_cut.fastq.gz McLGH-QUALITY_PASSED_R1_cut.fastq.gz McLGI-QUALITY_PASSED_R1_cut.fastq.gz --output_directory graftM_SCTLD --graftm_package /ufrc/data/reference/graftm/7/7.07.ribosomal_protein_L2_rplB.gpkg
You could also choose other gene databases, like nifH. Look around in the /ufrc/data/reference/graftm/7/ directory for the other options.
This list is not meant to be exhaustive, just to give ideas of where to start. I encourage lab members to add more ideas/tools that they find useful.
Heatmaps showing relative abundance of functions in metagenomes, separated by samples or by taxonomy; probably a very overused visualization, but it is a convenient way to show a lot of data in a small space. Download annotation data from IMG/MER and/or RAST and use R to organize, visualize, do stats, and plot heatmaps.
Bubble plots are an alternative to heatmaps, where bubble size correlates with abundance and bubbles can be colored by sample type, taxonomy, etc. Again, use R for preparing/analyzing/plotting data.
Phylogenetic trees of particular genes of interest: pull out the full-length genes from your annotated metagenomes or MAGs, along with sequences from other groups for comparison. I use the free software MEGA (https://www.megasoftware.net/) to align sequences and to make trees from either nucleic acid sequences or amino acid sequences.
Gene neighborhoods/Gene Synteny: For functions achieved by gene clusters, how conserved are the operons across species? I haven't made this type of figure for several years, so I'm not sure the best program/method for generating this figure. For an example, see Figure 5 in this recent paper: https://rdcu.be/bDJnU
GToTree: A new tool for "phylogenomics" ---Basically building phylogenetic trees from whole genomes/MAGs. I have not played around with this yet, but looks like a very promising tool using a lot of files you may have already generated in the course of binning and anvi'o. Github repository and paper on GToTree.