DIAMOND nr - ababaian/serratus GitHub Wiki

DIAMOND nr search (AWS EC2)

To test a collection of sequences against the BLAST nr (non-redundant protein) database quickly, we set-up an AWS EC2 instance with a DIAMOND database.

Objective is to test if a sequence is known.

1. Launch EC2 instance

Launch an EC2 instance via AWS Console with these parameters. Use a c5n.xlarge for the initial networking, then switch to r5d.4xlarge for creating the index or search.

EC2 Set-up Parameters

  • OS: Amazon Linux 2 AMI (HVM) x86
  • ami: ami-0be2609ba883822ec
  • instance: c5.xlarge // r5d.4xlarge
  • description: "c5n.xlarge (- ECUs, 4 vCPUs, 3.4 GHz, -, 10.5 GiB memory, EBS only)"
  • description: "r5d.4xlarge (16 vCPU 128 GB 2 x 300 NVMe SSD)"
  • storage: 450 GiB SSD (gp3)
  • encryption: false

2. Install DIAMOND

# From base amazon linux 2
sudo yum install -y docker git

# From `serratus-align` container
mkdir diamond; cd diamond

# Install diamond2 
# Libraries for building diamond2
sudo yum -y install git gcc gcc-c++ glibc-devel \
  cmake patch automake zlib-devel make

# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond

mkdir bin; cd bin
cmake ..
make -j4
sudo cp ./diamond /usr/bin/diamond
sudo chmod 755 /usr/bin/diamond

3. Download nr database

As of 210721, the nr database uncompressed (as below) is 192GB. This may take some time to set-up.

# DOWNLOAD BLAST DB - NR
mkdir -p ~/nr; cd nr
wget -O - ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
 | pigz -d - \
 > nr.fa
 
# And taxonomy data
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip

4. Create DIAMOND nr database

For this process, we will switch to a r5d.4xlarge instance for more processes at once and more memory.

# Switch to r5d.4xlarge instance with 450 GB block storage
# Make diamond nr db
# md5sum: 7158f0b4dfddc6f8e3c9d349a09e4f23
# size: 198GB

diamond makedb -p 14 --in nr.fa \
  --taxonmap prot.accession2taxid.gz \
  --taxonnodes nodes.dmp \
  --taxonnames names.dmp \
  -d nr

5. Run DIAMOND search

INFA='epsy_120_diamond.fa'
OUT='epsy_120_diamond'

# Diamond blastp alignment
time diamond blastp \
  -q  $INFA \
  -d ~/nr/nr.dmnd \
  --masking 0 \
  --unal 1 \
  --mid-sensitive -l 1 \
  -p14 -k1 \
  -f 6 qseqid  qstart qend qlen qstrand \
       sseqid  sstart send slen \
       pident evalue \
       full_qseq \
  > "$OUT".pro