Unix V: Identifying homologous sequences of a SARS Cov protein - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

Here we will extract homologous sequences of the coronavirus spike protein and identify possible sequence motifs.


Motif


SARS-CoV carries the S gene which encodes for the spike glycoprotein responsible for the fusion of the virus membrane with the host membrane. Download the protein from the Uniprot database, using this link: https://www.uniprot.org/uniprot/P59594.fasta. Note: You can use wget

Q1. How long is the sequence?

To investigate if we can find homologous sequences in other organisms, we will be using BLAST+ executables, a suite of command-line tools to run BLAST in the command line. This allows users to perform BLAST searches on their own server without size, volume and database restrictions. And it can be integrated directly into workflows.

Blast has already been installed in the cluster, thus we only need to load it so we can use the different tools:

module load blast+/2.14

To get a description of how to use the command, most programs can be run with the flag -h, -help or --help:

blastn -h

Let's set up the database where we will be searching for our protein:

  • Create a db directory
  • Go into the directory
  • Download the nucleotide virus database from the NCBI, use https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz and wget
  • Extract the data

Have a look at the data.

Q2. How many coronavirus genomes are in the database? How many come from Human?

  • Build the database using makeblastdb, one of the tools from BLAST+ that produces BLAST databases from FASTA files. If you type makeblastdb -h you will get information on all the flags you can use. For this exercise, use the following flags with the correct parameters:
  -in            [input_file]                         
  -input_type    [type]
  -dbtype        [molecule_type]
  -parse_seqids
  -out           ncbi_genome_viruses_db

You will note that several files were created, these are needed to access the data with a unique identifier and allows you to associate every sequence with a taxonomic node.

Now, let's check if there are homologous sequences of the human spike protein in our virus database. blast is a tool that finds regions of similarity between biological sequences and there are different flavors: blastn, blastx, blastp and tblastn (check the BLAST webpage to read about the programs). We will be using tblastn, run it using:

   -db       [database_name]
   -query    [input_file]
   -out      P59594.blastout

Remember that you can type tblastn -h to check the different flags and you can add other thresholds if you want.

Inspect P59594.blastout. You will see it is the same report format if you would use the web application.

Q3. From which organisms are the best hits? Are the sequences with high E-values coronaviruses?

This is a difficult format if you would like to automate a process. Rerun your command adding -outfmt 6 and save the output as P59594.tab.blastout. The column names are:

   1  qaccver   means Query accesion.version
   2  saccver   means Subject accession.version
   3  pident    means Percentage of identical matches
   4  length    means Alignment length
   5  mismatch  means Number of mismatches
   6  gapopen   means Number of gap openings
   7  qstart    means Start of alignment in query
   8  qend      means End of alignment in query
   9  sstart    means Start of alignment in subject
   10 send      means End of alignment in subject
   11 evalue    means Expect value
   12 bitscore  means Bit score

Usually hits with E-values less than 0.001 show good degree of similarity while hits with greater E-values show a distant phylogenetic relationship.

Q4. How many sequences have E-value equal or less than 0.001?. What is the coverage percentage of the hits with E-value greater than 0.001 with respect to our protein length (bp=1255)?

Let´s extract the sequences. As an example, take the first identifier of your blast table and run blastdbcmd, a tool to extract sequences from a BLAST database. Use:

  -db       [dbname]
  -dbtype   [molecule_type]
  -entry    [sequence_identifier]
  -range    [x-x]

Now, automate the process to extract all the sequences of your results and save them in a file named P59594_hits.fasta. Remember to check blastdbcmd -help for more options.

Now, let´s find some motifs. First we need to install MEME, a software tool for finding patterns in sequences. This is only done once! after this installation you can use meme directly.

  • Follow Steps 1 and 2 from the "Quick Install from Source (for macOS, Linux and WSL)" website
    • Download the software from https://meme-suite.org/doc/download.html [SELECT the most recent meme tarball]
    • Untar the file
    • Go into the meme directory
    • Run the following command to configure the file to do the installation
      ./configure --prefix=$HOME/meme --enable-build-libxml2 --enable-build-libxslt
      
    • Run the actual installation process:
      make
      make test
      make install
      

NOTE: There will be some Errors, don't worry it will still be "installed", although not in the correct location.

Have a look under the src directory, you should see meme as an executable program along with other programs.

  • To run meme via the command line without having to type the absolute path, open your ~/.bashrc with a text editor and add the following line (don´t forget to add the correct path):
export PATH=/home/path_to_meme_installation/src/:$PATH
  • Save the document and close it
  • Run this command to excute the export command you just added:
source ~/.bashrc

Now you can run meme to identify possible motifs in P59594_hits.fasta. Run the tools so you can identify 5 motifs with an e-value threshold of 0.001. You should run the tool in the background (&) since it takes around 36 mins to complete.

Have a look at the html report. Remember you can use firefox to open it.

Q5. Are the motifs found in all the samples?


Home: Unix applied to genomic data


Developed by Alina Orozco and Marcela Dávila, 2021. Updated by Marcela Dávila, 2024