Anvi'o pipeline

Anvi'o allows for analysis of the co-assembled shotgun sequencing reads by creating metagenome-assembled genomes (MAGs) and providing tools for subsequent analysis.

Anvi'o pipeline

Installation

Anvi'o provides detailed installation instruction here. You will want to have this installed on both the cluster and your local computer.

Once you have followed the instructions, whenever you want to use the Anvi'o tools, simply activate the environment.

conda activate anvio-7.1

Perform Sunbeam pipeline assembly and coassembly steps

If your files are too large, or you have too many samples, you may find that you can't run the coassemly step on the genomics partition. You will know this is the case if the job appears to stall after a long time – the job has actually stalled due to the job being killed at the time limit of 4 hours.

If this is the case, simply remove the --partition and --qos flags, and then increase the time limit.

# Assembly with Megahit - assembles reads to produce longer contiguous sequences
sunbeam run --configfile sunbeam_config.yml --cluster "sbatch --job-name=sunbeam_all_assembly --account=of33 --time=04:00:00 --mem-per-cpu=8G --ntasks=1 --cpus-per-task=20 --partition=genomics --qos=genomics" -j 8 -w 60 -p all_assembly --max-jobs-per-second 1 --keep-going

# Coassembly - produces a single file with all contiguous sequences
sunbeam run --configfile sunbeam_config.yml --cluster "sbatch --job-name=sunbeam_all_coassembly --account=of33 --time=04:00:00 --mem-per-cpu=8G --ntasks=1 --cpus-per-task=20 --partition=genomics --qos=genomics" -j 8 -w 60 -p --use-conda all_coassemble --rerun-incomplete --max-jobs-per-second 1 --keep-going

Start a new Anvi'o folder (anvio) for your project

Anvi'o will create a lot of output files. Therefore it is a good idea to create a new folder specifically for these analyses.

We will create this new folder within the main project directory, i.e. the one that contains the sunbeam_config.yml and samples.csv files.

# While in the main project directory...
mkdir anvio; cd anvio

# Copy the coassembly into the new folder
cp ../sunbeam_output/assembly/coassembly/all_final_contigs.fa .

Generate contigs database

Firstly we need to reformat the coassembly file as Anvi'o requires simple headers. We can also filter for contig length at this stage using the -l argument. Then we will create a contigs database which will keep all of the original contig information (ORFs, k-mer frequencies, functional and taxonomic information etc.)

The -T argument in anvi-gen-contigs-database specifies the number of processors to use, while the -n argument allows you to give a project name.

Then we will run the hidden Markov models (HMM) to decorate our contigs database with hits from HMM models that ship with the platform (which, at this point, constitute multiple published bacterial single-copy gene collections).

# Reformat fasta file and filter for contigs >1000
anvi-script-reformat-fasta all_final_contigs.fa -o contigs-fixed.fa -l 1000 --simplify-names

# Create a contig database
anvi-gen-contigs-database -f contigs-fixed.fa -o contigs.db -T 12 -n 'Shotgun Sequencing'

# Run hmms (hidden markov models using single-copy core gene collections)
anvi-run-hmms -c contigs.db -T 12

At this stage you should transfer your outputs to a local computer (rather than the cluster) so you can use the interactive visualisation tools not available on the cluster.

The most efficient way to do this is with rclone – see here for installation instructions.

# {LOCAL}
# Navigate to local folder you want to transfer data into
cd [/path/to/local/anvio]

# Copy cluster files into the local folder
rclone copy -v massive:mf33_scratch/Matt/ShotgunSequencing/anvio/ .

Now that we have our files locally too, we can view the contig statistics.

# {LOCAL}
# Have a look at contig stats
anvi-display-contigs-stats contigs.db

Map reads to the coassembly

Among other things (like enabling variant detection), mapping our reads for each sample to the coassembly gives us coverage information for each contig in each sample, which will help us with our efforts to recover metagenome-assembled genomes (MAGs).

For this purpose we need to generate alignment BAM files for each of our individual samples.

There are two possibilities here, and in theory you should be able to simply use the all_mapping tool from the Sunbeam pipeline, however this seems to not always work. Therefore, we discuss two potentials options: the default Sunbeam option, and the work-around using bash loops.

Mapping reads with Sunbeam

To map your reads with Sunbeam, the code is simple, and follows the same pattern as previous steps, such as the QC and taxonomic assignment steps.

Before we can get this to work though, we need to edit the sunbeam_config.yml file to provide the location of the coassembly contigs-fixed.fa file. There is a section of this file called mapping, and it should be edited similar to the following example.

The samtools_opts: '-F 4' line instructs samtools to discard any unmapped reads and reduces overall file size.

mapping:
  suffix: mapping
  genomes_fp: /fs03/mf33/Matt/ShotgunSequencing/anvio/contigs-fixed.fa
  samtools_opts: '-F 4'
  threads: 16

From here, we can run the call to Sunbeam's all_mapping tool.

# All mapping - maps coassembly back to the original sequences
sunbeam run --configfile sunbeam_config.yml --cluster "sbatch --job-name=sunbeam_all_mapping --account=of33 --time=04:00:00 --mem-per-cpu=8G --ntasks=1 --cpus-per-task=20 --partition=genomics --qos=genomics" -j 8 -w 60 -p all_mapping --max-jobs-per-second 1 --keep-going

If this doesn't work, i.e. there is no subdirectory created within sunbeam_output called mapping, then proceed with the alternative method below.

Mapping reads with bash loops and bowtie2

We first need to use bowtie2 to create an index from our contigs-fixed.fa file. This tool should have been installed when you installed Anvi'o.

bowtie2-build contigs-fixed.fa assembly

Now we can loop through our samples and map our individual samples to the coassembly. We can do this with the following code. You may need to adjust the base file paths for R1s and R2s as required if the relative paths don't work for you.

The indexed coassembly should now be present in your current folder as assembly.1.bt2, assembly.2.bt2 etc. You need to use ./assembly for the -x argument of bowtie2. Or rather, if you have done things exactly as above, there is no need to change anything!

#!/bin/bash

# A simple loop to serially map all samples.

# How many threads should each mapping task use?
NUM_THREADS=8

# Create a directory for mapping results if it doesn't exist
mkdir -p mapping

# This assumes that your main Sunbeam folder is the parent folder 
# of your current "anvio" folder, and that you have not renamed 
# the samples.csv file created by sunbeam init
for sample in `awk -F, 'BEGIN {OFS=","} { print $1 }' ../samples.csv`
do
    if [ "$sample" == "sample" ]; then continue; fi
    
    # you need to make sure you "ls 01_QC/*QUALITY_PASSED_R1*" returns R1 files for all your samples in samples.csv
    R1s=`ls ../sunbeam_output/qc/decontam/$sample*_1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
    R2s=`ls ../sunbeam_output/qc/decontam/$sample*_2* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
    
    echo "Running:" $R1s
    echo "Running:" $R2s
    
    bowtie2 --threads $NUM_THREADS -x ./assembly -1 $R1s -2 $R2s --no-unal -S mapping/$sample.sam
    samtools view -F 4 -bS mapping/$sample.sam > mapping/$sample-RAW.bam
    anvi-init-bam mapping/$sample-RAW.bam -o mapping/$sample.bam
    rm mapping/$sample.sam mapping/$sample-RAW.bam
done

# Mapping is done, and we no longer need bowtie2-build files
rm *.bt2

A breakdown of the code (in case you want to edit this) is as follows:

bowtie2 options:

--threads(-p): specifies how many CPUs we want to use.
-x: the "basename" of the index we built before; in our case it is called "assembly".
-1/-2: the forward and reverse reads of the samples we're using.
--no-unal: tells bowtie2 we don't want to record the reads that don't align to save storage space.
-S: specifies the name out the output SAM file we want to use.

samtools options:

view: is the subprogram of samtools that we are calling.
-F 4: exclude alignments with a 4 in the header - this is the marker for an unmapped read.
-b: output file in the BAM format.
-S: included just for backward compatability if the input is a SAM format file, but now samtools detected the file type itself.

You may have also noticed that we snuck the anvi-init-bam step into the loop, which prepares our BAM files by sorting and indexing them 🤩. Conveniently, we then delete the intermediate files including the unsorted BAM files.

Decontaminate samples using negative controls

If you have negative controls for your shotgun sequencing data, then you can decontaminate your samples using the R decontam package. We will describe here the process for doing this.

Create contig count tables for each sample

Firstly we need to run a bash loop on our BAM files from the previous step to create a counts table for the contig matches as follows. Create the new bash script in your main project directory, and it will create a new folder within your mapping folder filled with whitespace-delimited counts of each contig for each sample.

#!/bin/bash

# Create a directory for mapping results if it doesn't exist
mkdir -p mapping/alignment_counts

# Generate contig counts
for FILE_NAME in `ls mapping/*.bam`
do
    FILE="${FILE_NAME##*/}"
    rm -f "mapping/alignment_counts/${FILE}_RNAME.txt"
    samtools view "mapping/${FILE}" > "mapping/alignment_counts/tmp.txt"
    echo "count  RNAME" >> "mapping/alignment_counts/${FILE}_RNAME.txt"
    cut -f 3 "mapping/alignment_counts/tmp.txt" | sort | uniq -c >> "mapping/alignment_counts/${FILE}_RNAME.txt"
    rm "mapping/alignment_counts/tmp.txt"
done

Use R to prepare a phyloseq object from your counts

We will read in each of the contig alignment counts into a list element, adding the sample name as a third column via the mutate function. We need to use read_table here because the way that the bash uniq -c function outputs the data includes an initial right-aligned column for the counts, and then a second left-aligned column for the contig IDs. Therefore the usual read_csv or read_tsv functions cannot handle this.

# Set base file path of alignment read counts
base_fp <- here('anvio', 'mapping', 'alignment_counts')

# Read files into a list
alignment_counts_filenames <- list.files(path = base_fp, pattern = '.txt')
alignment_counts_raw <- list()
for (file in alignment_counts_filenames) {
  sample_name <- gsub('.bam_RNAME.txt', '', file)
  alignment_counts_raw[[sample_name]] <- read_table(here(base_fp, file), col_types = 'nc') %>%
    mutate(sample = sample_name)
}

From here, we will combine the list element via rbind using the do.call function, and pivot the data wider which in one step gives us the count table structure 🎉 From here, we will replace all of the NA values (i.e. contigs that didn't appear in a sample) with 0. Next, creating a column with contig IDs as numeric data will let us reorder the contigs numerically – after ordering we can delete this as we no longer need it.

# Reformat list into a single wide data.frame
alignment_counts <- do.call(rbind, alignment_counts_raw) %>%
  remove_rownames() %>%
  pivot_wider(names_from = 'sample', values_from = 'count') %>%
  replace(is.na(.), 0)  %>% # fill in all the gaps with zero
  mutate(contig_num = as.numeric(gsub('c_0*', '', RNAME))) %>% # create column to order contigs numerically
  arrange(contig_num) %>%
  dplyr::select(-contig_num) %>%
  column_to_rownames(var = 'RNAME')

We are assuming here that you have a set of metadata for your samples and negative controls, with rownames identical to the sample names used for sequencing. As such, we will reorder the samples to match the metadata order, double check that the sample names are identical, and then incorporate the data into a phyloseq object. For this purpose we also need to create the contig ID version of a phyloseq tax_table.

# Match column order to metadata
alignment_counts <- alignment_counts[, rownames(metadata_filtered)]
identical(colnames(alignment_counts), rownames(metadata_filtered))

# Generate a "taxonomy" table for the contig data
contig_df <- data.frame('contig_id' = rownames(alignment_counts), row.names = rownames(alignment_counts))

# Assemble contig counts into a phyloseq object
OTU <- otu_table(alignment_counts, taxa_are_rows = TRUE)
META <- sample_data(metadata_filtered)
sample_names(META) <- colnames(alignment_counts)
TAX <- tax_table(as.matrix(contig_df))

contig_data_raw <- phyloseq(OTU, META, TAX)

saveRDS(contig_data_raw, here(base_fp, 'contig_alignment_counts_ps.rds'))

Use R and decontam to identify contaminants

To do this, we will use the decontam package available from Bioconductor. This method uses the prevalence (presence/absence across samples) of each sequence feature in true positive samples compared to those in negative controls to identify contaminants. Because we want to be strict with this, we will use the special prevalence threshold value of 0.5, which identifies contaminants as anything more prevalent in negative controls than in true samples.

# Recover phyloseq object
base_fp <- here('anvio', 'mapping', 'alignment_counts')
contig_data_raw <- readRDS(here(base_fp, 'contig_alignment_counts_ps.rds'))

# Re-identify the negative controls
sample_data(contig_data_raw)$is_neg <- sample_data(contig_data_raw)$sample_type == 'negative_control'

# Run the isContaminant function
contamdf_prev <- isContaminant(contig_data_raw, method = 'prevalence', neg = 'is_neg', threshold = 0.5)
table(contamdf_prev$contaminant)

We can see how many contigs were identified as contaminants with the table function above, but we can also generate a plot to visualise this, as shown below.

# Make phyloseq object of presence-absence in negative controls and true samples
ps_pa <- transform_sample_counts(contig_data_raw, function(abund) 1 * (abund > 0))
ps_pa_pos <- subset_samples(ps_pa, is_neg == FALSE) # only true samples
ps_pa_neg <- subset_samples(ps_pa, is_neg == TRUE) # only negative controls

# Make data.frame of prevalence in positive and negative samples
df_pa <- data.frame('pa_pos' = taxa_sums(ps_pa_pos), 
                    'pa_neg' = taxa_sums(ps_pa_neg),
                    'contaminant' = contamdf_prev$contaminant)

# Visualise the contaminants
(decontam_plot <- ggplot(data = df_pa, aes(x = pa_neg, y = pa_pos)) + 
    geom_point(aes(colour = contaminant, fill = contaminant), 
               shape = 21, alpha = 0.5, 
               position = position_jitter(width = 0.05, height = 0.05)) +
    geom_abline(slope = 9, linetype = 2) +
    scale_fill_npg(name = 'Contaminant') +
    scale_colour_npg(name = 'Contaminant') +
    labs(title = 'Decontam results',
         subtitle = 'Prevalence method threshold: 0.5',
         x = 'Prevalence (Negative Controls)',
         y = 'Prevalence (True Samples)')
)
ggsave(here('figures', '01_decontam', 'contaminant_scatter.pdf'), decontam_plot,
       width = 18, height = 14, units = 'cm')

From here, we will remove the contaminant contigs from the phyloseq object, and save it to an RDS file.

# Remove contaminants
contig_data_clean <- prune_taxa(!contamdf_prev$contaminant, contig_data_raw)

# Keep only samples
contig_data_clean <- subset_samples(contig_data_clean, is_neg == FALSE)
saveRDS(contig_data_clean, here('output', '01_decontam', 'contig_data_clean_decontam.rds'))

Clean your contigs FASTA

Now that we have a list of all the remaining contigs we want to keep, we will filter our contigs-fixed.fa file, using the phylotools package to read and write the FASTA file.

# Read in the contigs file
contigs_fixed <- phylotools::read.fasta(file = here('anvio', 'contigs-fixed.fa'))

# Filter the contigs using the contig IDs from your decontaminated phyloseq
contigs_to_keep <- as.vector(tax_table(contig_data_clean)[,1])
contigs_fixed <- contigs_fixed %>%
  dplyr::filter(seq.name %in% contigs_to_keep)
saveRDS(contigs_fixed, here('output', '01_decontam', 'contigs_fixed_fasta_df.rds'))

# Rewrite the contigs back to FASTA format
phylotools::dat2fasta(contigs_fixed, here('anvio', 'contigs-clean.fa'))

Re-prepare your contigs database

Now that we have a clean contigs FASTA file, we need to go back and re-do the steps above, from generating the contigs database down to mapping the reads. We already have our cleaned counts data ready for use in R however, so we don't have to redo anything there.

You can quickly copy everything new back from your local computer to the cluster with rclone while you are inside your main Anvi'o project directory.

# Copy cluster files into the local folder
rclone copy -v . massive:mf33_scratch/Matt/ShotgunSequencing/anvio/

Unless you prefer to delete the original files to start completely fresh, it is suggested to add -clean to the end of your file names. This will however require you to adjust the code above to make use of the new, cleaned file. E.g. contigs-clean.fa etc.

Run NCBI COGs step

Let's head back to the cluster and run the NCBI clusters of orthologous genes (COGs) steps to annotate genes with functions.

# Install COGs database (only the first time)
anvi-setup-ncbi-cogs -T 12

# Run COGs to annotate genes with functions
anvi-run-ncbi-cogs -c contigs-clean.db -T 12 --sensitive

Extract protein sequences and eggNOG functional annotations

Do not use just yet... There seems to be an issue with the parser for the eggNOG-mapper outputs... 🤯 I'll come back to this later 🤔

We will now extract the translated protein sequences for our genes calls, and output the result to a new file called amino-acid-sequences.fa. Anvi'o also struggles to read back in eggNOG annotation outputs, so to help it match up the gene caller IDs, we need to add a g to the start of each defline of the FASTA file.

# Get amino acid sequences
anvi-get-sequences-for-gene-calls -c contigs-clean.db --get-aa-sequences -o amino-acid-sequences.fa

# Because Anvi'o struggles to read eggNOG output, add 'g' to start of each line
sed -i 's/>/>g_/' amino-acid-sequences.fa

Secondly, we can use this file with eggNOG-mapper to get an idea about the functionality of our gene calls, and then use another Anvi'o function anvi-script-run-eggnog-mapper to import the results. For eggNOG-mapper installation instructions, see here. You will need to make sure you have the eggNOG DIAMOND database installed as per their instructions.

#!/bin/bash
#SBATCH --job-name=eggnog_bac
#SBATCH --account=of33
#SBATCH --time=4:00:00
#SBATCH --mem-per-cpu=8G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --partition=genomics
#SBATCH --qos=genomics
mkdir -p emapper-results tmp
emapper.py -i amino-acid-sequences.fa --output amino-acid-sequences.fa --itype proteins -m diamond --data_dir /home/mmacowan/mf33/Databases/eggNOG --output_dir ./emapper-results --temp_dir ./tmp --cpu 12

Now we have the results, we can import them into our contigs-clean.db database.

# Use eggNOG database and emapper for functional annotation
anvi-script-run-eggnog-mapper -c contigs.db \
  -T 12 \
  --annotation emapper-results/amino-acid-sequences.fa.emapper.annotations \
  --use-version 2.1.6

Extract protein sequences and GhostKOALA functional annotations

👻🐨

As with the eggNOG steps above, we will extract the translated protein sequences for our genes calls, and output the result to a new file called amino-acid-sequences.fa. The parser used by GhostKOALA for reading in FASTA files throws an error if we provide it with deflines starting with numbers, so again, we'll add a prefix to help.

# Get amino acid sequences
anvi-get-sequences-for-gene-calls -c contigs-clean.db --get-aa-sequences -o amino-acid-sequences-gk.fa

# Add genecall_ to the start of each defline
sed -i 's/>/>genecall_/' amino-acid-sequences-gk.fa

The GhostKOALA tool will only accept files up to a maximum size of 300MB and 500,000 sequences. If your files are larger than this, you will need to separate them. One way to do this is using the csplit bash command. In the example below, if we have a file with about 600,000 gene calls, we could split the file in half using a REGEX pattern that will split the file at gene call 300,000:

csplit amino-acid-sequences-gk.fa /'>genecall_300000'/

Now, submit the amino-acid-sequences-gk.fa file to the GhostKOALA webserver. Click the 'Choose File' button beneath the section that says 'Upload query amino acid sequences in FASTA format'.

You will then be asked to provide an email address. You will receive an email and have to click a provided link to actually submit your job.

Download the GhostKOALA parser and local tools

Clone the GhostKOALA parser repository from GitHub.

# Create a new GhostKOALA folder within your main anvio folder
mkdir GhostKOALA; cd GhostKOALA

# Clone the GitHub repository into this new folder
git clone https://github.com/edgraham/GhostKoalaParser.git

The KEGG orthology table

There is a file in the parser folder called KO_Orthology_ko00001.txt, the head of it looks like this:

Metabolism	Overview	01200 Carbon metabolism [PATH:ko01200]	K00844	HK; hexokinase [EC:2.7.1.1]
Metabolism	Overview	01200 Carbon metabolism [PATH:ko01200]	K12407	GCK; glucokinase [EC:2.7.1.2]
Metabolism	Overview	01200 Carbon metabolism [PATH:ko01200]	K00845	glk; glucokinase [EC:2.7.1.2]
Metabolism	Overview	01200 Carbon metabolism [PATH:ko01200]	K00886	ppgK; polyphosphate glucokinase [EC:2.7.1.63]
Metabolism	Overview	01200 Carbon metabolism [PATH:ko01200]	K08074	ADPGK; ADP-dependent glucokinase [EC:2.7.1.147]

Parsing the results from GhostKOALA

The file you download from GhostKOALA will be called user_ko.txt; go ahead and change it to something more identifiable, such as ghost-koala-results.txt and then move it into the GhostKOALA folder.

# Convert the KEGG annotations for use with Anvi'o
python GhostKoalaParser/KEGG-to-anvio \
  --KeggDB GhostKoalaParser/samples/KO_Orthology_ko00001.txt \
  -i ghost-koala-results.txt

# Rename the output file
mv KeggOrthology_Table1.txt ghost-koala-results-anvio.txt

Import the GhostKOALA functions

We can head back to the main Anvi'o folder now, and simply call the following function:

anvi-import-functions -c contigs-clean.db -i GhostKOALA/ghost-koala-results-anvio.txt

Profile your BAM files and then create a merged profile

Create the individual profiles from BAM files

In contrast to the contigs-db, an Anvi’o single-profile-db stores sample-specific information about contigs. Profiling a BAM file with Anvi’o using anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results. Each single-profile-db links to a single contigs-db, and Anvi’o can merge single profiles that link to the same contigs database into merged profile database (which will be covered later).

In other words, the profiling step makes sense of each BAM file separately by utilizing the information stored in the contigs database. It is one of the most critical (and also most complex and computationally demanding) steps of the metagenomic workflow.

To this process for each of our samples, we will need to run the following bash script. You may want to change the name of the profiled folder to something like profiled_2500 if you think you may need to test multiple minimum contig lengths. 2500 is the recommended minimum length, and although you can try shorter lengths, NEVER go below 1000.

#!/bin/bash

# A simple loop to serially profile all samples.

# How many threads should each mapping task use?
NUM_THREADS=12

# Create a directory for mapping results if it doesn't exist
mkdir -p profiled

# Profile each file
for sample in `awk -F, 'BEGIN {OFS=","} { print $1 }' ../samples.csv`
do
    file="mapping-clean/$sample.bam"
    output_name="profiled_$sample"
    output_dir="profiled/$sample"
    
    echo $file
    
    anvi-profile -i $file \
    -c contigs-clean.db \
    --output-dir $output_dir \
    --sample-name $output_name \
    --min-contig-length=2500 \
    -T $NUM_THREADS
done

Merge the profiles

Next we wil merge all of the Anvi'o profiles using the program anvi-merge. Change the main folder name if required, e.g. profiled_2500.

anvi-merge profiled/*/PROFILE.db -o MetaG-merge -c contigs-clean.db --enforce-hierarchical-clustering

Visualise your merged Anvi'o profile

You can visualise your merged Anvi'o profile by copying your data from the cluster to your local computer as above.

Then, the following basic command will allow you to visualise your data, and do several things such as manual binning of your metagenomic sequences.

anvi-interactive -p MetaG-merge/PROFILE.db -c contigs-clean.db

Once your click the Draw button at the bottom left of the screen, you will be shown a circular plot, with the de novo phylogenetic tree in the centre, and your samples around the outside. Various summary statistics are shown in the top-right 90° of the circle, such as mean coverage etc.

1) Manual binning

To manually bin your sequences, navigate to the Bins tab, and by trial and error select different branches in the phylogenetic tree. You will be shown the genome completeness and redundancy metrics to the right. Ideally you do not want to see it turn red, but keeping redundancy below 15% is probably okay for now; you can always filter this later, but we may miss out on information if we are too strict with lower depth sequencing runs.

You can click the Recalculate / Show Taxonomy for Bins button below the bins to check taxonomy and name your bins appropriately.

Make sure to save your bins using the Store bin collection button.

2) Binning with MetaBAT2

Head to https://bitbucket.org/berkeleylab/metabat/src/master/ to see the full installation instructions.

If you are working on a Linux system however, try:

sudo apt install metabat

Create a new directory in your Anvi'o project directory, and run the MetaBAT2 and generate a depth file.

# Create a new directory
mkdir metabat; cd metabat

# Create the depth file
jgi_summarize_bam_contig_depths --outputDepth metabat_depth_file ../mapping-clean/*.bam

Now we can run MetaBAT2 itself.

# Run MetaBAT2
metabat2 -i ../contigs-clean.fa \
  -o metabat_binned_2500/metabat \
  -a metabat_depth_file \
  -m 2500 -v

3) Binning with Rosella

Do not use just yet... There seems to be an issue with this process at the moment... 🤯 I'll come back to this later 🤔

Head to https://rhysnewell.github.io/rosella/ to see the installation and running instructions.

# Create a new environment and install Rosella
conda create -n rosella -c bioconda rosella
conda activate rosella
rosella --version

# Install flight (a requirement for Rosella)
# On the cluster, you will need to use `pip3` instead of `pip`
pip install git+https://github.com/rhysnewell/flight.git#egg=flight-genome

Briefly, we will use the pre-computed depth file from MetaBAT2 to run Rosella. This will likely take some time, so consider running on the cluster (with the rosella environment activated).

#!/bin/bash
#SBATCH --job-name=rosella
#SBATCH --account=of33
#SBATCH --time=2-00:00:00
#SBATCH --mem-per-cpu=6G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24

rosella recover -i ../metabat/metabat_depth_file \
  -r ../contigs-clean.fa \
  --output-directory rosella_out \
  --threads 24

4) Binning with BusyBee

Similar to Rosella, this binning tool uses dimensionality reduction and clustering to construct contig bins. It is available as a web tool here.

Simply click Submit new job and provide the contigs-clean.fa co-assembly FASTA file and select your settings. Make sure you disable the Adjust FASTA names option so that you can easily map back to Anvi'o later. Enable the Taxonomic annotation option.

It is also recommended that you test different options in the Bootstrapping parameters. A good threshold for The minimum contig length is 1000, and minimum Probability is 0.05. Also try with the Transformation set to CLR.

Once the tool has run, you can view your results, and download data via the provided link.

Import binning results

You will probably need some level of manipulation of the data in R to generate the import format that Anvi'o wants. Essentially you should have a .txt file with two tab-separated column. The first column should contain the contig IDs, and the second column should contain the bin.

You can import the binning result similar to the following example (see more information here):

anvi-import-collection external-binning-results/BusyBee-bins.txt \
  -c contigs-clean.db \
  -p MetaG-merge/PROFILE.db \
  -C BusyBee \
  --contigs-mode

Comparing multiple binning results

If you have more than one set of binning results, you can combine them together for viewing using the following code as an example. This is required, because Anvi'o can only view one collection set at a time.

# Merge the binning results
anvi-script-merge-collections -c contigs-clean.db \
  -i external-binning-results/*.txt \
  -o binning-results.tsv

# View the binning results
anvi-interactive -p MetaG-merge/PROFILE.db \
  -c contigs-clean.db \
  -A binning-results.tsv

Manually curating automatic binning outputs

The manual curation steps we take now will help to improve our bins. The first step is to view the bins, and attempt to assign taxonomic classifications to them (by manually renaming the bins with predicted taxonomy). Then we can look at the genome completeness and redundancy.

# View bins and manually assign taxonomic names to the bins
anvi-interactive -p MetaG-merge/PROFILE.db \
  -c contigs-clean.db \
  --collection-autoload BusyBee

# View estimated genome completeness and redundancy
anvi-estimate-genome-completeness -p MetaG-merge/PROFILE.db \
  -c contigs-clean.db \
  -C BusyBee

At this stage we should merge bins with the same taxonomy, using the code below as an example.

anvi-merge-bins -p MetaG-merge/PROFILE.db \
  --collection-name BusyBee \
  --bin-names-list "Veillonella_1, Veillonella_2" \
  --new-bin-name Veillonella

If you were to check anvi-estimate-genome-completeness again, you would likely see that the completion percentage has increased, but most likely the redundancy would have increased a lot too.

So, what we can do now is refine the bins as shown below with our Veillonella example. This will bring up an interactive window where we can choose new bins that will replace the existing bin we are editing.

anvi-refine -p MetaG-merge/PROFILE.db \
  -c contigs-clean.db \
  -C BusyBee \
  -b Veillonella

WARNING: It is critical that you don't play the game of using single copy genes as the driver of identifying or refining your bins. In theory, you could hand-pick every split to achieve the highest completion percentage with 0% redundancy estimates. While this would look great, the purpose here is not to get these estimates to perfection, but rather to make sure that the binning is not completely inappropriate. Of course, you can remove groups of branches from via the hierarchical clustering, but selecting small nodes simply to remove the redundant single copy genes should be avoided.

From here, we repeat the process to refine our bins until we are happy with the end result (or rather as happy as we can be).

Phylogenomics

The Anvi'o tutorial for this will go into more detail, and can be viewed here.

Selecting genes from an HMM profile

Putting genomes into a phylogenetic context is a common way of comparing them to each other. To do this, we will extract a FASTA file of concatenated single-copy core genes.

anvi-get-sequences-for-hmm-hits -c contigs-clean.db \
  -p MetaG-merge/PROFILE.db \
  -o seqs-for-phylogenomics.fa \
  --list-hmm-sources

This will show us the HMM sources, and will look something like:

AVAILABLE HMM SOURCES
===============================================
* 'Archaea_76' (76 models with 1,242 hits)
* 'Bacteria_71' (71 models with 2,184 hits)
* 'Protista_83' (83 models with 174 hits)
* 'Ribosomal_RNA_12S' (1 model with 0 hits)
* 'Ribosomal_RNA_16S' (3 models with 21 hits)
* 'Ribosomal_RNA_18S' (1 model with 0 hits)
* 'Ribosomal_RNA_23S' (2 models with 36 hits)
* 'Ribosomal_RNA_28S' (1 model with 0 hits)
* 'Ribosomal_RNA_5S' (5 models with 0 hits)

Since we are interested in bacteria, we will choose the Bacteria_71 dataset, and ask for a list of the available genes.

# Get a list of single-copy genes
anvi-get-sequences-for-hmm-hits -c contigs-clean.db \
  -p MetaG-merge/PROFILE.db 
  -o phylogenomics/seqs-for-phylogenomics.fa 
  --hmm-source Bacteria_71 
  --list-available-gene-names

* Bacteria_71 [type: singlecopy]: ADK, AICARFT_IMPCHas, ATP-synt, ATP-synt_A,                                                           
Chorismate_synt, EF_TS, Exonuc_VII_L, GrpE, Ham1p_like, IPPT, OSCP, PGK,
Pept_tRNA_hydro, RBFA, RNA_pol_L, RNA_pol_Rpb6, RRF, RecO_C, Ribonuclease_P,
Ribosom_S12_S23, Ribosomal_L1, Ribosomal_L13, Ribosomal_L14, Ribosomal_L16,
Ribosomal_L17, Ribosomal_L18p, Ribosomal_L19, Ribosomal_L2, Ribosomal_L20,
Ribosomal_L21p, Ribosomal_L22, Ribosomal_L23, Ribosomal_L27, Ribosomal_L27A,
Ribosomal_L28, Ribosomal_L29, Ribosomal_L3, Ribosomal_L32p, Ribosomal_L35p,
Ribosomal_L4, Ribosomal_L5, Ribosomal_L6, Ribosomal_L9_C, Ribosomal_S10,
Ribosomal_S11, Ribosomal_S13, Ribosomal_S15, Ribosomal_S16, Ribosomal_S17,
Ribosomal_S19, Ribosomal_S2, Ribosomal_S20p, Ribosomal_S3_C, Ribosomal_S6,
Ribosomal_S7, Ribosomal_S8, Ribosomal_S9, RsfS, RuvX, SecE, SecG, SecY, SmpB,
TsaE, UPF0054, YajC, eIF-1a, ribosomal_L24, tRNA-synt_1d, tRNA_m1G_MT,
Adenylsucc_synt

From here we can select a subset of genes, say a bunch of ribosomal genes of interest, and output the concatenated genes. We will return only the best hit for each, and return the FASTA file as amino acids instead of DNA (which should provide better results for single-copy genes coming from distant clades).

anvi-get-sequences-for-hmm-hits -c contigs-clean.db \
  -p MetaG-merge/PROFILE.db \
  -o phylogenomics/seqs-for-phylogenomics.fa \
  --hmm-source Bacteria_71 \
  -C BusyBee1000 \
  --gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \
  --concatenate-genes \
  --return-best-hit \
  --get-aa-sequences

Computing the phylogenomic tree

Now that we have our concatenated genes, we can perform the phylogenomic analysis. We can then view the resulting proper newick tree immediately in anvi-interactive manual mode. See extra information and viewing options here.

One useful viewing option is to switch from colours to bin names by selecting Text from Main > Layers > bin_name and re-clicking Draw.

# Compute a phylogenomic tree
anvi-gen-phylogenomic-tree -f phylogenomics/seqs-for-phylogenomics.fa \
  -o phylogenomics/phylogenomic-tree.txt

# View the tree in manual mode
anvi-interactive --tree phylogenomics/phylogenomic-tree.txt \
  -p temp-profile.db \
  --title "Phylogenomics of IGD Bins"
  --manual

But! We can do a lot more with our phylogenetic tree, such as using it to reorganise our bins while showing their distribution across samples.

anvi-interactive -p MetaG-merge/PROFILE.db \
  -c contigs-clean.db \
  -C BusyBee \
  --tree phylogenomics/phylogenomic-tree-txt

Pangenomics analysis

Run KEGG KOfams

While we don't need this right now, we will need it later for the metabolism prediction steps, and doing so now means we don't need to re-split our genomes later.

anvi-run-kegg-kofams -c contigs-clean.db -T 8

Split your bins into separate profiles

Once you have your bins saved (with a certain name, e.g. bins_manual), you can use anvi-split to separate your sequences into their respective genomes. The -C argument is the collection name (in our case bins_manual) and the -o output argument should be the name of a new output folder (i.e. doesn't exist currently) and will contain a new folder for each bin name.

anvi-split -p MetaG-merge/PROFILE.db \
  -c contigs-clean.db \
  -C BusyBee \
  -o genomes_split

Generate an Anvi'o genomes storage

Follow the detailed instructions on generating a genomes storage here. Essentially you will want to turn external genome FASTA files into Anvi'o contig databases using anvi-gen-contigs-database, and populate the database with the most useful information (such as annotating genes with functions etc.). Store these in a separate pangenomics/external-genomes folder, with genomes from different species located in different folders for nice file organisation.

Key annotation tools include:

anvi-run-hmms: uses HMMs to annotate your genes against an hmm-source.
anvi-run-scg-taxonomy: associates single-copy core genes with taxonomic data.
anvi-scan-trnas: identifies the tRNA genes.
anvi-run-ncbi-cogs: tries to assign functions to your genes using the COGs database.

Moraxella catarrhalis example

In this example, our particular dataset has high estimated completeness and low redundancy for M. catarrhalis. As such, we will retrieve a set of public genomes for this bacteria, and create genomes for each.

Head to the NCBI taxonomy database and search for your species, in this case the NCBI taxonomy ID is 480. On the right side of the screen, click the Genome direct link. If there is a representative genome, download its genome in FASTA format. If there are additional genomes, click the RefSeq link in the line that reads Download sequence and annotation from RefSeq or GenBank. This will take you to the list of genomes at the FTP site, where you can select other genomes – you want to download the file with the suffix _genomic.fna.gz.

Move the download files into a taxa-specific folder within the external genomes folder, e.g. in pangenomics/external-genomes/moraxella-catarrhalis. From here, create a new bash script called prepare_external_genomes.sh in the external-genomes folder.

For reference, the $1 part of the script indicates that the first argument after the script call in bash is to be used here; in our case the name of the folder to be processed.

#!/bin/bash

for FILE_NAME in `ls $1/*.f*`
do
    # Prepare variables
    FILE="${FILE_NAME##*/}"
    SAMPLE="${FILE%.*}"
    echo "Running genome: ${SAMPLE}"
    # Generate contigs database
    echo "Fixing FASTA file names"
    anvi-script-reformat-fasta "${1}/${FILE}" -o "${1}/${FILE}-fixed" --simplify-names --seq-type NT
    echo "Generating contigs database..."
    anvi-gen-contigs-database -f "${1}/${FILE}-fixed" -o "${1}/${SAMPLE}.db"
    echo "Annotating genes via HMMs..."
    anvi-run-hmms -c "${1}/${SAMPLE}.db" --num-threads 8
    echo "Running SCG taxonomy..."
    anvi-setup-scg-taxonomy
    anvi-run-scg-taxonomy -c "${1}/${SAMPLE}.db"
    echo "Scanning for tRNAs..."
    anvi-scan-trnas -c "${1}/${SAMPLE}.db"
    echo "Running NCBI COGs annotation..."
    anvi-run-ncbi-cogs -c "${1}/${SAMPLE}.db" -T 8
    echo "Running KEGG KOfam annotation..."
    anvi-run-kegg-kofams -c "${1}/${SAMPLE}.db" -T 8
done

This will loop through the FASTA files and create populated databases for each. You can customise the number of threads for lines with -T. The loop will take quite some time depending on how many genomes you want to process (maybe ~3-5 minutes per genome).

Running the script from the external-genomes folder, you can simply add the folder name to the script call to process all external genomes. For example:

bash prepare_external_genomes moraxella_catarrhalis

N.B. make sure you have set up your NCBI COGs (anvi-setup-ncbi-cogs) and KEGG KOfams (anvi-setup-kegg-kofams) before running.

Now that we have our prepared external databases, we need to set up two files – one for the external databases and one for the internal database. Create these files within a new folder, such as pangenomics/genomes-storage/moraxella-catarrhalis.

These should be tab-separated text files with the following formats (using our Moraxella example):

External databases

name	contigs_db_path
M_catarrhalis_ASM106284	../../external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM106284v1.db
M_catarrhalis_ASM362611	../../external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM362611v1.db
M_catarrhalis_ASM937905	../../external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM937905v1.db

Internal database

name	bin_id	collection_id	profile_db_path	contigs_db_path
M_catarrhalis_local	Moraxella_catarrhalis	BusyBee	../../../MetaG-merge/PROFILE.db	../../../contigs-clean.db

Now we can use these two files to generate our genomes storage database.

anvi-gen-genomes-storage -i m-catarrhalis-internal-genomes.txt \
  -e m-catarrhalis-external-genomes.txt \
  -o M-catarrhalis-GENOMES.db

Computing and visualising the pangenome

Now that we have our genomes storage, we can characterise the pangenome.

anvi-pan-genome -g M-catarrhalis-GENOMES.db \
  -n Moraxella_catarrhalis \
  -o PAN \
  --num-threads 10

You can view the initial (ugly) pangenome display as follows.

anvi-display-pan -g M-catarrhalis-GENOMES.db \
  -p PAN/M-catarrhalis-PAN.db \
  --title "Moraxella catarrhalis Pangenome"

Adding average nucleotide identity

Pangenomic analysis offers insight into similarities and dissimilarities between genomes given the amino acid sequences of the open reading frames we identified. We could also compare the average nucleotide identity between genomes.

We can do this using the following function, and add this information to the pangenomic database.

anvi-compute-genome-similarity -e m-catarrhalis-external-genomes.txt \
  -i m-catarrhalis-internal-genomes.txt \
  --program pyANI \
  -o ANI \
  -T 8 \
  --pan-db PAN/Moraxella_catarrhalis-PAN.db

Functional enrichment analyses

We can now consider whether there are functional differences between genomes. To do so, we need a categorical variable to group by, and a tab-delimited text file describing which genomes go into which group.

name	group
M_catarrhalis_52347_C01	genome_other
M_catarrhalis_ASM1060391	genome_other
M_catarrhalis_ASM362611	genome_other
M_catarrhalis_ASM937939	genome_other
M_catarrhalis_52347_E01	genome_other
M_catarrhalis_ASM106284	genome_other
M_catarrhalis_ASM937905	genome_other
M_catarrhalis_rep	genome_rep
M_catarrhalis_local	genome_local

Let's import this data.

anvi-import-misc-data -p PAN/Moraxella_catarrhalis-PAN.db \
  --target-data-table layers \
  additional-layers-data.txt

We can then compute functional enrichment. But keep in mind this is more useful if you have groups of genomes that are actually different, e.g. different species for example, rather than just different genomes of the same species.

anvi-compute-functional-enrichment-in-pan -p PAN/Moraxella_catarrhalis-PAN.db \
  -g M-catarrhalis-GENOMES.db \
  --category group \
  --annotation-source COG20_FUNCTION \
  -o functional-enrichment.txt

Metabolism prediction

To run these steps, you will first have to run anvi-run-kegg-kofams on your local contigs database. This should have been run earlier before genome splitting.

Prepare a metagenomes file

For metabolism prediction of a local genome AND those of external genomes of the same taxa, for example Moraxella catarrhalis, we need to prepare a tab-delimited genomes file pointing to the locations of each contigs database. Since we have already prepared them previously via the anvi-split function for our local genome and in the pangenomics section for the external genomes, we can prepare something like below.

To keep things tidy, we should first create a new folder to store metabolic predictions, and a sub-folder for the bacteria of interest.

# Create the new directory and change into it
mkdir metabolism metabolism/moraxella-catarrhalis
cd metabolism/moraxella-catarralis

# Create the metagenomes list file
touch metagenomes-list.txt

name	contigs_db_path
M_catarrhalis_52347_C01	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_52347_C01.db
M_catarrhalis_ASM1060391	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM1060391v1.db
M_catarrhalis_ASM362611	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM362611v1.db
M_catarrhalis_ASM937939	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM937939v1.db
M_catarrhalis_52347_E01	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_52347_E01.db
M_catarrhalis_ASM106284	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM106284v1.db
M_catarrhalis_ASM937905	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_ASM937905v1.db
M_catarrhalis_rep	../../pangenomics/external-genomes/moraxella-catarrhalis/m_catarrhalis_rep_genome.db
M_catarrhalis_local	../../genomes_split/Moraxella_catarrhalis_1/CONTIGS.db

Estimate metabolism in the Moraxella catarrhalis genomes

Now we can run the anvi-estimate-metabolism program on the metagenomes from within the same folder we created above. This will look at the KOfam annotations in each genome and match them to KEGG module definitions to estimate the completeness of each module. And yes, the -e flag is technically the one for external genomes, but it does work in this case too where we want to compare our local genome with the external ones.

anvi-estimate-metabolism -e metagenomes-list.txt \
  -O Moraxella \
  --matrix-format

When the program has finished, there will be three output files:

A matrix of module completeness scores.
A matrix of module presence/absence (as binary 1 or 0 outputs) – the threshold for presence by default is 75%.
A matrix counting the number of hits to each KO in each genome.

Visualise module completion as a heatmap

We can view the module completeness matrix as a heatmap by generating a newick tree. For example:

anvi-matrix-to-newick Moraxella-completeness-MATRIX.txt

Then we use the --dry-run flag to ask anvi-interactive to give us a new profile database, and then run the interactive interact to view the module completion across our genomes.

# Dry run to get the profile database
anvi-interactive -d Moraxella-completeness-MATRIX.txt \
  -p Moraxella_metabolism_PROFILE.db \
  --manual-mode \
  --dry-run

# Now use the profile for interactive viewing
anvi-interactive --manual-mode \
  -d Moraxella-completeness-MATRIX.txt \
  -t Moraxella-completeness-MATRIX.txt.newick \
  -p Moraxella_metabolism_PROFILE.db \
  --title "Moraxella catarrhalis metabolism heatmap"

Because the KEGG modules IDs aren't very informative, let's generate a useful miscellaneous data file to import into our profile database which matches the module IDs to text descriptions.

# Determine where the MODULES.db is
export ANVIO_MODULES_DB=`python -c "import anvio; import os; print(os.path.join(os.path.dirname(anvio.__file__), 'data/misc/KEGG/MODULES.db'))"`

# Start an empty file
echo -e "module\tclass\tcategory\tsubcategory\tname" > modules_info.txt

# Get module classes
sqlite3 $ANVIO_MODULES_DB "select module, data_value from kegg_modules where data_name='CLASS'" | \
  sed 's/; /|/g' | \
  tr '|' '\t' >> module_class.txt

# Get module names
sqlite3 $ANVIO_MODULES_DB "select module, data_value from kegg_modules where data_name='NAME'" | \
  tr '|' '\t' > module_names.txt

# Join everything
paste module_class.txt <(cut -f 2 module_names.txt) >> modules_info.txt

# Remove the unneeded files
rm module_names.txt module_class.txt

Now we can import this file into the profile database.

anvi-import-misc-data modules_info.txt \
  -p Moraxella_metabolism_PROFILE.db \
  -t items

Get more detailed metabolism output

This time we will try running the detailed metabolism output, using the long-format output.

anvi-estimate-metabolism -e metagenomes-list.txt \
  -O Moraxella_metabolism \
  --kegg-output-modes modules,kofam_hits

Citation

If you used this repository in a publication, please mention its url.

In addition, you may cite the tools used by this pipeline:

Anvi'o: Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319 https://doi.org/10.7717/peerj.1319

Rights

Copyright (c) 2023 Mucosal Immunology lab, Monash University, Melbourne, Australia.
Authors: M. Macowan