Lab 2: Introduction to Command Line (2) - jacksonhturner/turner_epp531 GitHub Wiki

This document details the steps necessary for the completion of Lab 2 of EPP531.

In-class portion of Lab 2 (1-9)

1. Download the arabidopsis genome. wget

2. Unzip/decompress the file. gunzip TAIR10_chr_all.fas.gz

3. See what the genome looks like. less TAIR10_chr_all.fas

4. Count the number of chromosomes. grep -c '>' TAIR10_chr_all.fas

5. Download the protein sequences. wget

6. Loading the program "Blast" spack load blast-plus

Downloading the zebrafish protein sequences. curl -o zebrafish.1.protein.faa.gz -L

7. Make a database for blast. makeblastdb -in zebrafish.1.protein.faa -title Danio -dbtype prot -out danio

8. Run blast to compare "mgProteome.fasta" peptide sequence to Zebrafish database. blastp -query ../Lab1/Data/mgProteome.fasta -db danio -out proteins_blastp.txt

9. Discuss the results.

HW portion of Lab 2: run the same commands using the program "Diamond" (10)

The documentation provided below is designed to fulfill the homework assignment for Lab 2, where I leverage diamond blastp to compare zebrafish and arabidopsis protein sequences. The processes described here are executed in /pickett_sphinx/projects/EPP531_AGA/turner/Lab2.

1.) Load diamond

Since diamond is available through spack, I call it with the following command:

spack load diamond

Because there are multiple installations of diamond on our server, I choose to call the most recent version.

spack load /h7fm4ww

2.) Create a diamond reference

To compare amino acid sequences between the taxa selected for this exercise, I must create a diamond reference file from one of them. I generate a diamond reference for the zebrafish proteome through the following command:

diamond makedb --in zebrafish.1.protein.faa -d danio

3.) Align amino acid sequences with diamond

Similar to BLAST, diamond utilizes a blastp function to align amino acid sequences. To align amino acid sequences of selected taxa, the following command is executed:

diamond blastp -q Araport11_pep_20220914.aa -d danio.dmnd -o diamond_results.tsv

The query sequence cannot be aligned due to the presence of the ' ' character in at least one sequence. To resolve this error, I used the sed command to replace all instances of ' ' with '_'.

sed 's/ /_/g' Araport11_pep_20220914.aa > 2Araport11_pep_20220914.aa

Once all ' ' characters have been removed from the query, blastp may be executed successfully.

diamond blastp -q 2Araport11_pep_20220914.aa -d danio.dmnd -o diamond_results.tsv

A small selection of amino acid sequence alignment statistics may be viewed using the less command.

less diamond_results.tsv

4.) Interpreting results

The .tsv file reported by diamond provides 12 columns to allow for the interpretation of pairwise alignment results. A guide to the interpretation of each column is publicly available through this link.

5.) Concluding thoughts

Diamond is an invaluable tool for pairwise sequence alignment when using an amino acid reference. Its speed, reported as 100,000x faster than BLAST, and customizability set it apart from existing protein sequence aligners. These strengths provide it a variety of use cases, including metagenomics and systematics. Further information concerning diamond is available for public use through its github page.