Unix V: Identifying homologous sequences of a SARS Cov protein - BDC-training/VT25 GitHub Wiki
Course: VT25 Unix applied to genomic data (SC00036)
Here we will extract homologous sequences of the coronavirus spike protein and identify possible sequence motifs.
SARS-CoV carries the S gene which encodes for the spike glycoprotein responsible for the fusion of the virus membrane with the host membrane. Download the protein from the Uniprot database, using this link: https://www.uniprot.org/uniprot/P59594.fasta. Note: You can use wget
Q1. How long is the sequence?
To investigate if we can find homologous sequences in other organisms, we will be using BLAST+ executables, a suite of command-line tools to run BLAST in the command line. This allows users to perform BLAST searches on their own server without size, volume and database restrictions. And it can be integrated directly into workflows.
Blast has already been installed in the cluster, thus we only need to load
it so we can use the different tools:
module load blast+/2.14
To get a description of how to use the command, most programs can be run with the flag -h
, -help
or --help
:
blastn -h
Let's set up the database where we will be searching for our protein:
- Create a
db
directory - Go into the directory
- Download the nucleotide virus database from the NCBI, use
https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
andwget
- Extract the data
Have a look at the data.
Q2. How many coronavirus genomes are in the database? How many come from Human?
- Build the database using
makeblastdb
, one of the tools from BLAST+ that produces BLAST databases from FASTA files. If you typemakeblastdb -h
you will get information on all the flags you can use. For this exercise, use the following flags with the correct parameters:
-in [input_file]
-input_type [type]
-dbtype [molecule_type]
-parse_seqids
-out ncbi_genome_viruses_db
You will note that several files were created, these are needed to access the data with a unique identifier and allows you to associate every sequence with a taxonomic node.
Now, let's check if there are homologous sequences of the human spike protein
in our virus database. blast
is a tool that finds regions of similarity between biological sequences and there are different flavors: blastn, blastx, blastp and tblastn (check the BLAST webpage to read about the programs). We will be using tblastn
, run it using:
-db [database_name]
-query [input_file]
-out P59594.blastout
Remember that you can type tblastn -h
to check the different flags and you can add other thresholds if you want.
Inspect P59594.blastout
. You will see it is the same report format if you would use the web application.
Q3. From which organisms are the best hits? Are the sequences with high E-values coronaviruses?
This is a difficult format if you would like to automate a process. Rerun your command adding -outfmt 6
and save the output as P59594.tab.blastout
. The column names are:
1 qaccver means Query accesion.version
2 saccver means Subject accession.version
3 pident means Percentage of identical matches
4 length means Alignment length
5 mismatch means Number of mismatches
6 gapopen means Number of gap openings
7 qstart means Start of alignment in query
8 qend means End of alignment in query
9 sstart means Start of alignment in subject
10 send means End of alignment in subject
11 evalue means Expect value
12 bitscore means Bit score
Usually hits with E-values less than 0.001 show good degree of similarity while hits with greater E-values show a distant phylogenetic relationship.
Q4. How many sequences have E-value equal or less than 0.001?. What is the coverage percentage of the hits with E-value greater than 0.001 with respect to our protein length (bp=1255)?
Let´s extract the sequences. As an example, take the first identifier of your blast table and run blastdbcmd
, a tool to extract sequences from a BLAST database. Use:
-db [dbname]
-dbtype [molecule_type]
-entry [sequence_identifier]
-range [x-x]
Now, automate the process to extract all the sequences of your results and save them in a file named P59594_hits.fasta
. Remember to check blastdbcmd -help
for more options.
Now, let´s find some motifs. First we need to install MEME, a software tool for finding patterns in sequences. This is only done once! after this installation you can use meme
directly.
- Follow Steps 1 and 2 from the "Quick Install from Source (for macOS, Linux and WSL)" website
- Download the software from https://meme-suite.org/doc/download.html [SELECT the most recent
meme
tarball] - Untar the file
- Go into the
meme
directory - Run the following command to configure the file to do the installation
./configure --prefix=$HOME/meme --enable-build-libxml2 --enable-build-libxslt
- Run the actual installation process:
make make test make install
- Download the software from https://meme-suite.org/doc/download.html [SELECT the most recent
NOTE: There will be some Errors, don't worry it will still be "installed", although not in the correct location.
Have a look under the src
directory, you should see meme
as an executable program along with other programs.
- To run
meme
via the command line without having to type the absolute path, open your~/.bashrc
with a text editor and add the following line (don´t forget to add the correct path):
export PATH=/home/path_to_meme_installation/src/:$PATH
- Save the document and close it
- Run this command to excute the
export
command you just added:
source ~/.bashrc
Now you can run meme
to identify possible motifs in P59594_hits.fasta
. Run the tools so you can identify 5 motifs with an e-value threshold of 0.001. You should run the tool in the background (&
) since it takes around 36 mins to complete.
Have a look at the html
report. Remember you can use firefox
to open it.
Q5. Are the motifs found in all the samples?
Unix applied to genomic data
Home:Developed by Alina Orozco and Marcela Dávila, 2021. Updated by Marcela Dávila, 2024