DIAMOND nr - ababaian/serratus GitHub Wiki
nr
search (AWS EC2)
DIAMOND To test a collection of sequences against the BLAST nr
(non-redundant protein) database quickly, we set-up an AWS EC2 instance with a DIAMOND
database.
Objective is to test if a sequence is known.
1. Launch EC2 instance
Launch an EC2 instance via AWS Console with these parameters. Use a c5n.xlarge
for the initial networking, then switch to r5d.4xlarge
for creating the index or search.
EC2 Set-up Parameters
- OS:
Amazon Linux 2 AMI (HVM) x86
- ami:
ami-0be2609ba883822ec
- instance:
c5.xlarge
//r5d.4xlarge
- description:
"c5n.xlarge (- ECUs, 4 vCPUs, 3.4 GHz, -, 10.5 GiB memory, EBS only)"
- description:
"r5d.4xlarge (16 vCPU 128 GB 2 x 300 NVMe SSD)"
- storage:
450 GiB SSD (gp3)
- encryption:
false
DIAMOND
2. Install # From base amazon linux 2
sudo yum install -y docker git
# From `serratus-align` container
mkdir diamond; cd diamond
# Install diamond2
# Libraries for building diamond2
sudo yum -y install git gcc gcc-c++ glibc-devel \
cmake patch automake zlib-devel make
# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond
mkdir bin; cd bin
cmake ..
make -j4
sudo cp ./diamond /usr/bin/diamond
sudo chmod 755 /usr/bin/diamond
nr
database
3. Download As of 210721, the nr
database uncompressed (as below) is 192GB. This may take some time to set-up.
# DOWNLOAD BLAST DB - NR
mkdir -p ~/nr; cd nr
wget -O - ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
| pigz -d - \
> nr.fa
# And taxonomy data
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
nr
database
4. Create DIAMOND For this process, we will switch to a r5d.4xlarge
instance for more processes at once and more memory.
# Switch to r5d.4xlarge instance with 450 GB block storage
# Make diamond nr db
# md5sum: 7158f0b4dfddc6f8e3c9d349a09e4f23
# size: 198GB
diamond makedb -p 14 --in nr.fa \
--taxonmap prot.accession2taxid.gz \
--taxonnodes nodes.dmp \
--taxonnames names.dmp \
-d nr
5. Run DIAMOND search
INFA='epsy_120_diamond.fa'
OUT='epsy_120_diamond'
# Diamond blastp alignment
time diamond blastp \
-q $INFA \
-d ~/nr/nr.dmnd \
--masking 0 \
--unal 1 \
--mid-sensitive -l 1 \
-p14 -k1 \
-f 6 qseqid qstart qend qlen qstrand \
sseqid sstart send slen \
pident evalue \
full_qseq \
> "$OUT".pro