BMA231 I: Sequence Bioinformatics - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)
The purpose of this exercise is to introduce you to general web-tools used in Sequence Bioinformatics.
To explore different databases and tools, you will be using data from the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). This is a new strain of coronavirus that just recently has been identified and has been associated with the disease commonly known as COVID-19 (as if you don't already know that!)
Wu, Fan et al. “A new coronavirus associated with human respiratory disease in China.” Nature vol. 579,7798 (2020): 265-269. doi:10.1038/s41586-020-2008-3 reported the genomic sequence of this virus from a single patient using NGS technology, in December 2019.
Huang, Y., Yang, C., Xu, Xf. et al. Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta Pharmacol Sin 41, 1141–1149 (2020). https://doi.org/10.1038/s41401-020-0485-4 provided a nice overview of the spike protein
structure and its relevance to antivirus drug development. This spike protein
will be the focus of this practical.
To have a better insight of the genomic organization of SARS-CoV-2
, Wu, Fan et al. compared this new virus to 2 representative members of the same taxonomic genus (Betacoronavirus). One associated with humans: SARS-CoV Tor2
and the other associated with bats bat SL-CoVZC45
.
Go to NCBI and use the accession numbers of these sequences (AY274119.3 for SARS-CoV Tor2 and MG772933.1 for bat SL-CoVZC45) to find the corresponding Nucleotide sequence
hit. Click on it to display the GenBank
report and answer the following questions.
Q1. How long are
SARS-CoV Tor2
andSL-CoVZC45
genomes? Include a screenshot
Q2. Which country reported each sequence? Include a screenshot
Q3. For each genome, How many proteins do they seem to have? Which ones are the three longest ones? Hint: you can click on the
Protein
link under Related information or you can click onGraphics
under the name of the sequence for a more visual inspection). Include a screenshot
Now let's inspect the SARS-CoV-2 isolate Wuhan-Hu-1
. The ID for this sequences is: MN908947, according to Wu F. et al. paper. However there is a curated version: NC_045512
, let's use this one.
Q4. How long is this genome?
We will analyze this data as if we didn't know anything about it, hopefully we will reach to the same information that has been published!
One first thing to do would be to compare our "unknown" sequence to a database to get some annotation. In this case we will compare our sequence with the ones from the article.
Download the nucleotide sequence in FASTA format:
- Click on the
FASTA
link (just below the name of the sequence)
You will see only the genomic sequence - Click on
Send to
(on the right side) - Select
File
- Select
FASTA
as Format - Select
Show GI
- Click on
Create File
- Save it as
covid_genome.fasta
Note: When using webtools it is common to use only the ID number, since the tools are coupled to the database. So this is just to let you know that you can download the data in case you need to.
Go to the NCBI's BLAST webpage
- Select
Global Align
under Specialized searches
We will be aligning 2 sequences rather than one sequence against a database - Type
AY274119
under Enter Query Sequence
The ID for theSARS-CoV Tor2
- Upload the
covid_genome.fasta
file under Enter Subject Sequence
Our "unknown" sequence - Leave the default parameters as they are
- Click on
BLAST
You will see the report and the hits that produced significant alignments, in our case only one.
Go to the Graphic Summary tab, you can see how sequences aligned all the way with a high score (depicted with a red line, from position 1 to >27500)
Go to the Alignments tab:
Q5. How similar are both sequences? Are there any gaps? Include a screenshot
Go to the Dot Plot tab:
Q6. Which sequence is depicted in the X-axis? What can you conclude from this graph? Include a screenshot
Repeat the same analysis, but now using Bat SARS-like
sequence.
This text is from the Wu F. et al. paper: "WHCV was most closely related to the bat coronavirus SL-CoVZC45 with 82.3% amino acid identity (and around 77.2% amino acid identity to SARS-CoV)"
Q7. Did your analysis give similar results? Why?
Now let's identify potential proteins by searching ORFs (Open Reading Frames).
Go to ORFfinder. This is an NCBI tool that searchers for ORFs given a DNA sequence, in our case, the SARS-CoV-2
genome.
- Copy/paste your sequence or Type the accession number.
- Leave the default parameters as they are
- Click on
Submit
You will see the entire genomic sequence displayed, where every red box is a predicted ORF:
Q8. How many ORFS were found? Include a screenshot
Q9. How many nucleotides are there in the longest and shortest ORF? Include a screenshot. Hint: You can order the table below the visualization by clicking on the headers
Compare these ORFs with the three longest proteins from SARS-CoV Tor2
and the bat SL-CoVZC45
, you made a note on them when we inspected the genome of these viruses:
Q10. Which ORFs from
SARS-CoV-2
correspond to these three longest proteins? From which strand and Frame are they transcribed? Include a screenshot
In Figure 1 of Wu F. et al. paper, you can visualize the similarity between the 3 genomes.
Go back to the GenBank record of NC_045512
and locate the gene S
(the spike glycoprotein). You can see that this gene produces a structural protein called spike protein
and is coded from nucleotides 21563 to 25384.
Let's download the sequences so we can have a closer look:
- Click on the
CDS
link of CDS 21563..25384
You will see it highlighted at the end of the entry - Click
FASTA
as Display
Now you have the nucleotide sequence of the region where thespike protein
is coded
- Click on the Send to link, on the right side
- Select
File
- Select
FASTA
as Format - Select
Show GI
- Click on
Create File
- Save it as
spike_genomic.fasta
Let's translate it into a protein. Go to EMBL-EBI services. The European Bioinformatics Institute (EMBL-EBI) has a lot of freely available and up-to-date molecular data resources. Here we will use sixpack
, a tool that will give us the translation of our DNA for each frame:
- In the Search box type
sixpack
- Go to the tool
- Upload the
spike_genomic.fasta
file - Leave the default parameters as they are
- Click on
Submit
Now you can see all the different frames and the actual sequence, with a summary of the ORFs at the end of the screen.
Q11. How many ORFs were found by sixpack? Include a screenshot
Click on the Tool Output tab, here you will have each ORF in FASTA format
Q12. Which ORF codes for the
spike protein
? Include a screenshot
Q13. Do you think there is another protein coded in this region besides the
spike protein
? Why?
Save the spike protein
:
- Copy/paste the aminoacid sequence into a text editor
- Name it
spike_protein.fasta
Remember we are assuming that our sequence is "unknown". Let's see if we find more information about our protein within NCBI. To do this, first we need to identify if there are any matches in the database so we can infer any possible function. Go to the NCBI's BLAST webpage.
Q14. Which blast algorithm should we use and why?
Q15. If instead of using a protein we used the gene sequence, is it necessary to translate the nucleotide sequence (gene) into an aminoacid sequence (protein) before using BLAST?
- Select
Protein BLAST
under Web BLAST - Upload the
spike_protein.fasta
file - Select
UniProtKB/Swiss-Prot(swissprot)
as Database - Check that
blastp
is selected as Algorithm - Leave the default parameters as they are
- Click on
BLAST
Q16. Based on the description of the matching sequences, what is the name of our protein? Make a note of the Sequence ID of the best matching sequence. Include a screenshot
Go to the Graphic Summary tab:
Q17. Are all the sequences covering the entire length of our protein sequence? Include a screenshot
Go to the Alignments tab and scroll down until the last hit is shown. The expected value is quite small (~3e-87), however the identity to our protein is only around 30%:
Q18. Do you think this is a true hit? Why?
Go to the Taxonomy tab:
Q19. Are the sequences found afecting only human? Include a screenshot
Q20. Are all the sequences from viruses infecting the respiratory system? Include a screenshot
Go back to the Descrptions tab.
- Select all 50 sequences
- Click on
Dowload
- Select
FASTA (complete sequence)
- Save the file as
spike_proteins.fasta
We will use this file later on.
Go to UniProt and use the sequence ID from the best hit we got (you made a note under the Functional annotation with BLAST) and search the database. We could've blasted the protein as well:
Q21. What is the UniProt ID for this sequence? Include a screenshot
Have a look at the Function section:
Q22. What does
SPIKE_SARS2
do? No more than 20 words
Go to the Family & Domains section.
Q23. Which domain/region is the responsible for the virus binding to the human cell? Hint: Browse the table under Features. Include a screenshot
To corroborate our knowledge on the function of the spike protein
, watch any of these 2 videos (or both):
-
Animation of SARS-CoV-2 entry into human host-cell (1:54 min) from the Thorn Lab
-
Model of Membrane Fusion by SARS CoV-2 Spike Protein (0:42 sec) from Digizyme.
Let’s investigate how different spike protein sequences are related through evolution. The first thing to do is to align the sequences so we can see any similarities/differences and then visualize these differences with a phylogenetic tree.
Go to Clustal Omega, another service from EMBL-EBI. This is a program that aligns multiple sequences, in our case 50 spike proteins
from different organism:
- Upload the
spike_proteins.fasta
file - Leave the default parameters as they are
- Click on
Submit
In the Alignments tab, you will see already some differences between the sequences (in terms of length and aminoacid identity).
Now, let's download the alignment file so we can use it in a third party tool:
- Click on the Result Files tab
- Save the "Alignment in CLUSTAL format with base/residue numbering" file as
spike_proteins.clustal_num
Go to the Phylogenetic Tree tab. Here you will see the graphical representation of the alignment:
Q24. Is this phylogenetic distribution similar to the one is shown in the Extended Data Fig. 5. This task is quite difficult since we don't have the taxonomy displayed in our sequences (and though there are ways to fix this, it is a little outside of our scope). Use This document to compare the taxonomy in Extended Data Fig. 5. The table displays all 50 proteins in the same order as in your phylogenetic tree (check it just to be sure!). It also has information about their taxonomy.
Let's explore a little more the alignment file with JalView
, a free program for multiple sequence alignment editing, visualisation and analysis.
First, download JalView
(if you don't alreay have it):
- Check that Java is installed in your computer:
- Follow this instructions to check
- If Java is installed go ahead with the Jalview installation
- If not, follow this instructions
- Follow this instructions to check
- Go to this Download page
- Download an installer or archive for your platform
- Follow the instructions
Open JalView
and open the spike_proteins.clustal_num
file (File
-> Input alignment
-> From File
).
Try the following color schemes:
-
Colour
->Percentage Identity
Here you can see how each column is colored according to the percentage of the residues that agree with the consensus sequence.
Q25. Is there any particular region(s) where the percentage of Identity is evidently high? Include a screenshot
-
Colour
->Clustalx
This coloring scheme represents the category of the mayority of the residues in each column, for example:- Blue -> Hydrophobic
- Red -> Positive charged
- Magenta-> Negative charged
- Green-> Polar
- Pink-> Cysteines
- Orange -> Glycines
- Yellow-> Prolines
- Cyan-> Aromatic
Q26. Is there any particular region(s) where a specific residue category is more evident? Include a screenshot
-
Colour
->Hydrophobicity
This is color scale, where the most hydrophobic residues are colored red, while the most hydrophilic are colored blue
Q27. Is there any particular region(s) that are highly hydrophobic/hydrophilic? Include a screenshot
To help in the visualization, we can cluster our sequences based on their phylogeny:
- Select all the sequences ID (use <shift> on the ID column)
-
Calculate
->Calculate Tree or PCA ...
- Check
Neighbour Joining
as Tree - Click on
Calculate
In the Tree window, click within the screen. A red line will appear (where you clicked). This will color the different sequences depending on their branch. Draw a line as indicated in the figure, this will separate the Alphacoronavirus, Gammacoronavirus, Sarbecovirus (Betacoronavirus) and Merbecovirus (Betacoronavirus) by color.
Go to the Alignment window. You will see how the different sequences are now grouped and easier to identify (hopefully!).
In Extended Data Fig. 7 from the Wu F. et al. paper, a partial alignment of different SARS-like CoVs is shown. There are 5 key aminoacids and 2 motifs that are involved in the interaction with human ACE2, marked by orange squares (TGNYN
AND FSPDGKPCTP-PAL
) and red boxes respectively.
Let's identify the 2 motifs in our alignment. You could use Select
-> Find
, to search for short aminoacid patterns. Note: have in mind that the sequences the authors used to create the alignment are different from our sequences, so the positions of these key aminoacids/motifs will NOT be the same
Q28. In which sequences did you find the patterns? Are these conserved in other taxonomic groups? Include a screenshot
To visualize the spike protein structure, go to PDB. Search for 6VW1
. This entry will show us the region where the spike protein is binding the ACE2 receptor
.
Q29. Which method was used to elucidate this structure? Include a screenshot
Under Macromolecules, there are 2 molecules listed, one is the ACE2 receptor
and the other is the RBD region of the spike protein
. These are depicted in the 3D View, the ACE2 receptor
is the orange structure and the spike protein
is in green.
Q30. Some molecules/reagents are necessary when the 3D structure of molecules are elucidated. In our case, there are four small molecules included in the solution containing the spike protein. Which are they? Include a screenshot
Go to the Structure tab. ACE2
is depicted in green while the RBD portion in the spike protein
is in purple.
If you zoom in, you will also see the other molecules and their interactions, depicted with blue boxes ad green spheres.
- Select
2: SARS-CoV-2 chimeric RBD
under the Sequence of section - In the sequence window identify the
TGNYN
motif, around position 450 - Click in one of the residues
This will highlight the residue in the structure view - Check all the residues
Q31. Do all the residues interact with
ACE2
? Include a screenshot
Now let's investigate some clinical samples from the spike protein
.
Go to the NCBI Virus portal.
- Click on the
Find Data
drop down menu at the top of the page - Select
Up-to-date SARS-CoV-2
- Click on the
Protein
tab (since we are looking for the spike protein)
Under the Refine Results
menu (to your left)
-
Locate Has Proteins
- Click on the
+
sign - Type
surface glycoprotein
- Select the suggestion:
surface glycoprotein
- Click on the
-
Under Nucleotide Completeness
- Click on the
+
sign - Select
complete
- Click on the
All these sequences (more than 1,000,000) are from patients from different countries (Geographic Regions
) that have been obtained from different tissues or fluids (Isolation Sources
) at different time points (Collection Date
). To make this easier (and quicker!) do the following:
- Define what you want to compare, for instance, you may be interested in how the spike protein differs in samples from:
- lung vs placenta vs blood (Tissue/Specimen/Source), or
- swabs done in Asia vs Oceania vs Benin (Geographic Region), or
- oronasopharynx in Egypt through time (Collection Date)
- Using the
Refine Results
menu, select your samples according to your criteria. Make sure to have:- At least 3 groups and at least 4 sequences per group
- In a text file, make note of the protein accession numbers of your selected sequences
Once you have selected all your sequences:
- Reset the
Refine Results
menu - Open the
Accession
category - Paste your accession numbers
- Click
Submit
- Check that under the Protein tab you get the correct number of proteins to download
- Select all the proteins
- Click on the
Download
drop down menu (at the top left of the page) - Under Sequence data (FASTA format) select
Protein
- Select
Download Selected Records
- Select
Build custom
- Select
GenBank Title
on the right panel and clickRemove
- Select
Geo Location
,Isolation Source
andCollection Date
This will help you to identify your sequences in the alignment and phylogenetic tree. If for example, you don't care about the Collection Date you can skip it - Click
Add
- Click
Download
- Save the file as
spike_samples.fasta
Now you are ready to investigate the phylogenetic relationship of your selected sequences. Run Clustal Omega and answer the following questions. Note: You can have a look at the results from Clustal Omega or you can use JalView.
Q32. What is your interpretation of the phylogenetic tree? Does it make sense? Include a screenshot
Q33. Are there any mutations in any of the 2 motifs involved in the interaction with human
ACE2
? (GGNYN
ANDYQAGSTPCNGVEGF
). Include a screenshot
Q34. Are there a lot of differences among your sequences? or are they fairly similar?. Include a screenshot
Well done! now you have the general idea of databases and general tools used in sequence bioinformatics
Developed by Marcela Dávila, 2021. Modified by Marcela Dávila, 2022