Microbiome Helper 2 Annotation of reads contigs with CARD RGI - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright, Modifications by: NA

Please note: We are still testing/developing this so use with caution :)

Introduction

We are often interested in annotating our data with Antibiotic Resistance Genes, and one of the most popular ways to do this is using the Comprehensive Antibiotic Resistance Database (CARD) Resistance Gene Identifier (RGI). This can be run with either reads or contigs - the length of the sequences we use typically doesn't matter. There are many bioinformatic tools that have been designed for this purpose, but one downfall of many tools is that they aren't well maintained after they're developed. However, the Comprehensive Antibiotic Resistance Database (CARD) Resistance Gene Identifier (RGI) has been shown to work well, and is well maintained with the database being updated regularly, so it is what we tend to use for this purpose.

1. Download the database

First, make a directory for this data to be stored:

mkdir card_data
cd card_data

Now we can download the data (following the instructions from here:

wget https://card.mcmaster.ca/latest/data --no-check-certificate

Unzip it:

tar -xvf data ./card.json

Load the database and then find out what version the database is:

rgi load --card_json ./card.json --local
rgi database --version --local

We should see that this is version 3.2.9. Note that the --version command is something that works with many tools.

Now we'll get another part of the database and unzip it:

wget -O wildcard_data.tar.bz2 https://card.mcmaster.ca/latest/variants
mkdir -p wildcard
tar -xjf wildcard_data.tar.bz2 -C wildcard
gunzip wildcard/*.gz

Now we'll get the annotations:

rgi card_annotation -i localDB/card.json > card_annotation.log 2>&1
rgi wildcard_annotation -i wildcard --card_json localDB/card.json -v 3.2.9 > wildcard_annotation.log 2>&1

And finally load the database:

rgi load \
  --card_json localDB/card.json \
  --debug --local \
  --card_annotation card_database_v3.2.9.fasta \
  --wildcard_annotation wildcard_database_v3.2.9.fasta \
  --wildcard_index wildcard/index-for-model-sequences.txt \
  --wildcard_version 3.2.9 \
  --amr_kmers wildcard/all_amr_61mers.txt \
  --kmer_database wildcard/61_kmer_db.json \
  --kmer_size 61

Note that if you didn't see version 3.2.9 above, you should change this command to reflect the version that you have downloaded (i.e. that is current!).

2. Run CARD RGI

Set up the folders:

cd ..
mkdir card_out

And then we'll run CARD RGI using parallel:

parallel -j 1 'rgi main -i {} -o card_out/{/.} -t contig -a DIAMOND -n 1 --include_loose --local --clean' ::: kneaddata_out/*.fastq

You can see a description of them for yourself by typing rgi main --help, but there are some that might not be so obvious:

  • -t - whether the data input is contigs (also use this option for reads!) or proteins
  • -a - the alignment tool to use (DIAMOND or BLAST)
  • --include-loose - that we want to include loose hits in addition to strict and perfect hits
  • --local - that we want to use the local database (i.e., that we don't need to download the database)
  • --clean - that we want to remove temporary files when we're done

Once this finishes running, we can make a heatmap:

rgi heatmap --input card_out/ --output card_heatmap

In these heatmaps, yellow represents a perfect hit, teal represents a strict hit, and purple represents no hit.

You can search for these genes on the CARD website to find out some more information about them.