BMA231 I: Sequence Bioinformatics - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)


The purpose of this exercise is to introduce you to general web-tools used in Sequence Bioinformatics.



Introduction

To explore different databases and tools, you will be using data from the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). This is a new strain of coronavirus that just recently has been identified and has been associated with the disease commonly known as COVID-19 (as if you don't already know that!)

Wu, Fan et al. “A new coronavirus associated with human respiratory disease in China.” Nature vol. 579,7798 (2020): 265-269. doi:10.1038/s41586-020-2008-3 reported the genomic sequence of this virus from a single patient using NGS technology, in December 2019.

Huang, Y., Yang, C., Xu, Xf. et al. Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta Pharmacol Sin 41, 1141–1149 (2020). https://doi.org/10.1038/s41401-020-0485-4 provided a nice overview of the spike protein structure and its relevance to antivirus drug development. This spike protein will be the focus of this practical.

The genome

To have a better insight of the genomic organization of SARS-CoV-2, Wu, Fan et al. compared this new virus to 2 representative members of the same taxonomic genus (Betacoronavirus). One associated with humans: SARS-CoV Tor2 and the other associated with bats bat SL-CoVZC45.

Go to NCBI and use the accession numbers of these sequences (AY274119.3 for SARS-CoV Tor2 and MG772933.1 for bat SL-CoVZC45) to find the corresponding Nucleotide sequence hit. Click on it to display the GenBank report and answer the following questions.

Q1. How long are SARS-CoV Tor2 and SL-CoVZC45 genomes? Include a screenshot

Q2. Which country reported each sequence? Include a screenshot

Q3. For each genome, How many proteins do they seem to have? Which ones are the three longest ones? Hint: you can click on the Protein link under Related information or you can click on Graphics under the name of the sequence for a more visual inspection). Include a screenshot

Now let's inspect the SARS-CoV-2 isolate Wuhan-Hu-1. The ID for this sequences is: MN908947, according to Wu F. et al. paper. However there is a curated version: NC_045512, let's use this one.

Q4. How long is this genome?

We will analyze this data as if we didn't know anything about it, hopefully we will reach to the same information that has been published!

Sequence similarity using BLAST

One first thing to do would be to compare our "unknown" sequence to a database to get some annotation. In this case we will compare our sequence with the ones from the article.

Download the nucleotide sequence in FASTA format:

  • Click on the FASTA link (just below the name of the sequence)
    You will see only the genomic sequence
  • Click on Send to (on the right side)
  • Select File
  • Select FASTA as Format
  • Select Show GI
  • Click on Create File
  • Save it as covid_genome.fasta

Note: When using webtools it is common to use only the ID number, since the tools are coupled to the database. So this is just to let you know that you can download the data in case you need to.

Go to the NCBI's BLAST webpage

  • Select Global Align under Specialized searches
    We will be aligning 2 sequences rather than one sequence against a database
  • Type AY274119 under Enter Query Sequence
    The ID for the SARS-CoV Tor2
  • Upload the covid_genome.fasta file under Enter Subject Sequence
    Our "unknown" sequence
  • Leave the default parameters as they are
  • Click on BLAST

You will see the report and the hits that produced significant alignments, in our case only one.

Go to the Graphic Summary tab, you can see how sequences aligned all the way with a high score (depicted with a red line, from position 1 to >27500)

Go to the Alignments tab:

Q5. How similar are both sequences? Are there any gaps? Include a screenshot

Go to the Dot Plot tab:

Q6. Which sequence is depicted in the X-axis? What can you conclude from this graph? Include a screenshot

Repeat the same analysis, but now using Bat SARS-like sequence.

This text is from the Wu F. et al. paper: "WHCV was most closely related to the bat coronavirus SL-CoVZC45 with 82.3% amino acid identity (and around 77.2% amino acid identity to SARS-CoV)"

Q7. Did your analysis give similar results? Why?

The proteome

Now let's identify potential proteins by searching ORFs (Open Reading Frames).

Go to ORFfinder. This is an NCBI tool that searchers for ORFs given a DNA sequence, in our case, the SARS-CoV-2 genome.

  • Copy/paste your sequence or Type the accession number.
  • Leave the default parameters as they are
  • Click on Submit

You will see the entire genomic sequence displayed, where every red box is a predicted ORF:

Q8. How many ORFS were found? Include a screenshot

Q9. How many nucleotides are there in the longest and shortest ORF? Include a screenshot. Hint: You can order the table below the visualization by clicking on the headers

Compare these ORFs with the three longest proteins from SARS-CoV Tor2 and the bat SL-CoVZC45, you made a note on them when we inspected the genome of these viruses:

Q10. Which ORFs from SARS-CoV-2 correspond to these three longest proteins? From which strand and Frame are they transcribed? Include a screenshot

In Figure 1 of Wu F. et al. paper, you can visualize the similarity between the 3 genomes.

The spike glycoprotein

Go back to the GenBank record of NC_045512 and locate the gene S (the spike glycoprotein). You can see that this gene produces a structural protein called spike protein and is coded from nucleotides 21563 to 25384.

Let's download the sequences so we can have a closer look:

  • Click on the CDS link of CDS 21563..25384
    You will see it highlighted at the end of the entry
  • Click FASTA as Display
    Now you have the nucleotide sequence of the region where the spike protein is coded
  • Click on the Send to link, on the right side
  • Select File
  • Select FASTA as Format
  • Select Show GI
  • Click on Create File
  • Save it as spike_genomic.fasta

Let's translate it into a protein. Go to EMBL-EBI services. The European Bioinformatics Institute (EMBL-EBI) has a lot of freely available and up-to-date molecular data resources. Here we will use sixpack, a tool that will give us the translation of our DNA for each frame:

  • In the Search box type sixpack
  • Go to the tool
  • Upload the spike_genomic.fasta file
  • Leave the default parameters as they are
  • Click on Submit

Now you can see all the different frames and the actual sequence, with a summary of the ORFs at the end of the screen.

Q11. How many ORFs were found by sixpack? Include a screenshot

Click on the Tool Output tab, here you will have each ORF in FASTA format

Q12. Which ORF codes for the spike protein? Include a screenshot

Q13. Do you think there is another protein coded in this region besides the spike protein? Why?

Save the spike protein:

  • Copy/paste the aminoacid sequence into a text editor
  • Name it spike_protein.fasta

Functional annotation with BLAST

Remember we are assuming that our sequence is "unknown". Let's see if we find more information about our protein within NCBI. To do this, first we need to identify if there are any matches in the database so we can infer any possible function. Go to the NCBI's BLAST webpage.

Q14. Which blast algorithm should we use and why?

Q15. If instead of using a protein we used the gene sequence, is it necessary to translate the nucleotide sequence (gene) into an aminoacid sequence (protein) before using BLAST?

  • Select Protein BLAST under Web BLAST
  • Upload the spike_protein.fasta file
  • Select UniProtKB/Swiss-Prot(swissprot) as Database
  • Check that blastp is selected as Algorithm
  • Leave the default parameters as they are
  • Click on BLAST

Q16. Based on the description of the matching sequences, what is the name of our protein? Make a note of the Sequence ID of the best matching sequence. Include a screenshot

Go to the Graphic Summary tab:

Q17. Are all the sequences covering the entire length of our protein sequence? Include a screenshot

Go to the Alignments tab and scroll down until the last hit is shown. The expected value is quite small (~3e-87), however the identity to our protein is only around 30%:

Q18. Do you think this is a true hit? Why?

Go to the Taxonomy tab:

Q19. Are the sequences found afecting only human? Include a screenshot

Q20. Are all the sequences from viruses infecting the respiratory system? Include a screenshot

Go back to the Descrptions tab.

  • Select all 50 sequences
  • Click on Dowload
  • Select FASTA (complete sequence)
  • Save the file as spike_proteins.fasta

We will use this file later on.

Functional annotation with UniProt

Go to UniProt and use the sequence ID from the best hit we got (you made a note under the Functional annotation with BLAST) and search the database. We could've blasted the protein as well:

Q21. What is the UniProt ID for this sequence? Include a screenshot

Have a look at the Function section:

Q22. What does SPIKE_SARS2 do? No more than 20 words

Go to the Family & Domains section.

Q23. Which domain/region is the responsible for the virus binding to the human cell? Hint: Browse the table under Features. Include a screenshot

To corroborate our knowledge on the function of the spike protein, watch any of these 2 videos (or both):

  1. Animation of SARS-CoV-2 entry into human host-cell (1:54 min) from the Thorn Lab

  2. Model of Membrane Fusion by SARS CoV-2 Spike Protein (0:42 sec) from Digizyme.

Evolution of the spike protein

Let’s investigate how different spike protein sequences are related through evolution. The first thing to do is to align the sequences so we can see any similarities/differences and then visualize these differences with a phylogenetic tree.

Go to Clustal Omega, another service from EMBL-EBI. This is a program that aligns multiple sequences, in our case 50 spike proteins from different organism:

  • Upload the spike_proteins.fasta file
  • Leave the default parameters as they are
  • Click on Submit

In the Alignments tab, you will see already some differences between the sequences (in terms of length and aminoacid identity).

Now, let's download the alignment file so we can use it in a third party tool:

  • Click on the Result Files tab
  • Save the "Alignment in CLUSTAL format with base/residue numbering" file as spike_proteins.clustal_num

Go to the Phylogenetic Tree tab. Here you will see the graphical representation of the alignment:

Q24. Is this phylogenetic distribution similar to the one is shown in the Extended Data Fig. 5. This task is quite difficult since we don't have the taxonomy displayed in our sequences (and though there are ways to fix this, it is a little outside of our scope). Use This document to compare the taxonomy in Extended Data Fig. 5. The table displays all 50 proteins in the same order as in your phylogenetic tree (check it just to be sure!). It also has information about their taxonomy.

Let's explore a little more the alignment file with JalView, a free program for multiple sequence alignment editing, visualisation and analysis.

First, download JalView (if you don't alreay have it):

  • Check that Java is installed in your computer:
  • Go to this Download page
  • Download an installer or archive for your platform
  • Follow the instructions

Open JalView and open the spike_proteins.clustal_num file (File -> Input alignment -> From File).

Try the following color schemes:

  • Colour -> Percentage Identity
    Here you can see how each column is colored according to the percentage of the residues that agree with the consensus sequence.

Q25. Is there any particular region(s) where the percentage of Identity is evidently high? Include a screenshot

  • Colour -> Clustalx
    This coloring scheme represents the category of the mayority of the residues in each column, for example:
    • Blue -> Hydrophobic
    • Red -> Positive charged
    • Magenta-> Negative charged
    • Green-> Polar
    • Pink-> Cysteines
    • Orange -> Glycines
    • Yellow-> Prolines
    • Cyan-> Aromatic

Q26. Is there any particular region(s) where a specific residue category is more evident? Include a screenshot

  • Colour -> Hydrophobicity
    This is color scale, where the most hydrophobic residues are colored red, while the most hydrophilic are colored blue

Q27. Is there any particular region(s) that are highly hydrophobic/hydrophilic? Include a screenshot

To help in the visualization, we can cluster our sequences based on their phylogeny:

  • Select all the sequences ID (use <shift> on the ID column)
  • Calculate -> Calculate Tree or PCA ...
  • Check Neighbour Joining as Tree
  • Click on Calculate

In the Tree window, click within the screen. A red line will appear (where you clicked). This will color the different sequences depending on their branch. Draw a line as indicated in the figure, this will separate the Alphacoronavirus, Gammacoronavirus, Sarbecovirus (Betacoronavirus) and Merbecovirus (Betacoronavirus) by color.

Go to the Alignment window. You will see how the different sequences are now grouped and easier to identify (hopefully!).

In Extended Data Fig. 7 from the Wu F. et al. paper, a partial alignment of different SARS-like CoVs is shown. There are 5 key aminoacids and 2 motifs that are involved in the interaction with human ACE2, marked by orange squares (TGNYN AND FSPDGKPCTP-PAL) and red boxes respectively.

Let's identify the 2 motifs in our alignment. You could use Select -> Find, to search for short aminoacid patterns. Note: have in mind that the sequences the authors used to create the alignment are different from our sequences, so the positions of these key aminoacids/motifs will NOT be the same

Q28. In which sequences did you find the patterns? Are these conserved in other taxonomic groups? Include a screenshot

The structure of the spike protein

To visualize the spike protein structure, go to PDB. Search for 6VW1. This entry will show us the region where the spike protein is binding the ACE2 receptor.

Q29. Which method was used to elucidate this structure? Include a screenshot

Under Macromolecules, there are 2 molecules listed, one is the ACE2 receptor and the other is the RBD region of the spike protein. These are depicted in the 3D View, the ACE2 receptor is the orange structure and the spike protein is in green.

Q30. Some molecules/reagents are necessary when the 3D structure of molecules are elucidated. In our case, there are four small molecules included in the solution containing the spike protein. Which are they? Include a screenshot

Go to the Structure tab. ACE2 is depicted in green while the RBD portion in the spike protein is in purple.

If you zoom in, you will also see the other molecules and their interactions, depicted with blue boxes ad green spheres.

  • Select 2: SARS-CoV-2 chimeric RBD under the Sequence of section
  • In the sequence window identify the TGNYN motif, around position 450
  • Click in one of the residues
    This will highlight the residue in the structure view
  • Check all the residues

Q31. Do all the residues interact with ACE2? Include a screenshot

A clinical application

Now let's investigate some clinical samples from the spike protein.

Go to the NCBI Virus portal.

  • Click on the Find Data drop down menu at the top of the page
  • Select Up-to-date SARS-CoV-2
  • Click on the Protein tab (since we are looking for the spike protein)

Under the Refine Results menu (to your left)

  • Locate Has Proteins

    • Click on the + sign
    • Type surface glycoprotein
    • Select the suggestion: surface glycoprotein
  • Under Nucleotide Completeness

    • Click on the + sign
    • Select complete

All these sequences (more than 1,000,000) are from patients from different countries (Geographic Regions) that have been obtained from different tissues or fluids (Isolation Sources) at different time points (Collection Date). To make this easier (and quicker!) do the following:

  • Define what you want to compare, for instance, you may be interested in how the spike protein differs in samples from:
    • lung vs placenta vs blood (Tissue/Specimen/Source), or
    • swabs done in Asia vs Oceania vs Benin (Geographic Region), or
    • oronasopharynx in Egypt through time (Collection Date)
  • Using the Refine Results menu, select your samples according to your criteria. Make sure to have:
    • At least 3 groups and at least 4 sequences per group
  • In a text file, make note of the protein accession numbers of your selected sequences

Once you have selected all your sequences:

  • Reset the Refine Results menu
  • Open the Accession category
  • Paste your accession numbers
  • Click Submit
  • Check that under the Protein tab you get the correct number of proteins to download
  • Select all the proteins
  • Click on the Download drop down menu (at the top left of the page)
  • Under Sequence data (FASTA format) select Protein
  • Select Download Selected Records
  • Select Build custom
  • Select GenBank Title on the right panel and click Remove
  • Select Geo Location, Isolation Source and Collection Date
    This will help you to identify your sequences in the alignment and phylogenetic tree. If for example, you don't care about the Collection Date you can skip it
  • Click Add
  • Click Download
  • Save the file as spike_samples.fasta

Now you are ready to investigate the phylogenetic relationship of your selected sequences. Run Clustal Omega and answer the following questions. Note: You can have a look at the results from Clustal Omega or you can use JalView.

Q32. What is your interpretation of the phylogenetic tree? Does it make sense? Include a screenshot

Q33. Are there any mutations in any of the 2 motifs involved in the interaction with human ACE2 ? (GGNYN AND YQAGSTPCNGVEGF). Include a screenshot

Q34. Are there a lot of differences among your sequences? or are they fairly similar?. Include a screenshot

Well done! now you have the general idea of databases and general tools used in sequence bioinformatics



Developed by Marcela Dávila, 2021. Modified by Marcela Dávila, 2022

⚠️ **GitHub.com Fallback** ⚠️