Sequence alignment and similarity

Sequence alignment is a method to compare two or more DNA, RNA, or protein sequences to identify regions of local similarity or overall similarity between the sequences.

Sequence similarity is a measure of the degree of likeness between sequences, often expressed as a percentage of identical or similar DNA bases or amino acids residues within an alignment. The figure below shows a pairwise sequence alignment. The alignment consists of trying to pair the two sequences in order for as many of the letters to be the same (this is a simplification). We can introduce gaps ("-")in the sequence to facilitate pairing as many of the letters as possible.

The identity between the two sequences above is 90% as calculated using the method shown in the figure. Sequence similarity above a certain threshold might imply Homology between the sequences. Homologs, Orthologs, and Paralogs share a common ancestor as broken down below:

Homologs

Definition: General term for genes/proteins that share a common evolutionary ancestor.
Includes both orthologs and paralogs.
Key point: Homology is binary — two sequences are either homologous or not.

Orthologs

Definition: Homologous genes in different species that diverged due to a speciation event.
Usually retain similar function across species.
Example: Human hemoglobin gene vs. mouse hemoglobin gene.

Paralogs

Definition: Homologous genes within the same species (or different species) that arose by gene duplication.
Often evolve new or specialized functions.
Example: Human hemoglobin alpha vs. beta chains.

Analogs

Definition: Genes or proteins with similar function or structure but no common evolutionary origin.
Result of convergent evolution.
Example: Insect wings vs. bird wings; or serine proteases in mammals vs. bacteria.

Term	Common Ancestor?	Origin Event	Function Similarity	Species Context
Homologs	Yes	Any (speciation or duplication)	Maybe	Any
Orthologs	Yes	Speciation	Often similar	Different species
Paralogs	Yes	Duplication	Often diverged	Same or related species
Analogs	No	Convergent evolution	Yes	Any

It is important to remember that sequences are either homologs or not. There is no degree of homology.

How much similarity is needed to imply homology? This as so often depends!

Above ~35%: generally considered to be confidently homologous (especially if coverage is high)
Between ~20–35% identity for proteins (with full-length alignments)
Below ~20%: called the "midnight zone", where detecting homology is unreliable without additional data

However, context matters:

For short sequences, even 40% may be ambiguous.
For long sequences (>150 residues), even ~25% identity with good coverage can indicate true homology.

Conserved motifs/domains can also provide supporting evidence of functional/structural similarity.

BLAST

BLAST (Basic Local Alignment Search Tool) is computer program widely used comparing a sequence to sequences in a database.

The BLAST algorithm is heuristic which means that it will produce a good enough result quickly compared to methods that are guaranteed to obtain the best result, such as the Smith-Waterman alignment algorithm and other dynamic programming approaches,

This emphasis on speed is necessary to making the algorithm practical on the huge genome databases currently available.

NCBI BLAST Exercise

The most commonly used BLAST service is the free service offered by NCBI to search the Genbank database. It can be found here. Select “BLAST” from the list on the right side of the screen (below).

After selecting BLAST, you should be seeing a page similar to the one below.

There are four options, Nucleotide BLAST, blastx, tblastn and Protein BLAST. Although almost all original data in Genbank is DNA and RNA sequences, the database also contain translated protein sequences.

[! Protein vs DNA] If we have a protein, and we are interested in similar proteins, the Protein BLAST is the most convenient and most sensitive as the evolutionary pressure act on the protein sequence.

Question 1

Select the Protein BLAST Use the Protein BLAST to search using the human protein sequence below:

>NP_061820.1
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE

The results from BLAST are always organized as a list with the most significant similarities in the beginning and the less similar sequences (also called “Hits”) in the end.

There are four tabs (Descriptions, Graphic Summary, Alignments and Taxonomy) with different kinds of information about the same results. Each result in the Descriptions is associated with seven categories of information, see table below:

tab	short description
Description	The description line from the database
Scientific Name	Latin name for the organism that the sequence belongs to..
Max score	The alignment score of the best match (local alignment) between the query and the database hit.
Total score	The sum of alignment scores for all matches (alignments) between the query and the database hit (if there is only one match per hit, these two scores are identical).
Query cover	The percentage of the query sequence that is covered by the alignment(s).
E value	The Expect value calculated from the Max score (i.e. the number of hits with that score or better you would expect to find for random reasons).
Per. Ident	The percent identity in the alignment(s).
Acc. Len.	Length of the sequence that produced the result.
Accession	The accession number of the database hit.

The Description might tell us what the similar gene is called, in this case “cytochrome c”. Cytochrome c is a protein in the electron transport chain (see below).

What is the Expect or E value?

This value can be understood as the chance of finding a similar sequences purely by chance. If we search a database of a certain size that consists of only random sequences, we would find alignments that depend on chance alone. This chance increases with the size of the database and decreases with the length of the query sequence.

An analogy can be made with the so called “bible code” where words has been extracted from the text of the bible by for example extracting every 50th letter (below).

Statisticians have proved that if the text is sufficiently long (like the Bible or some other long text) short words or phrases are will appear by chance. This is why the E-value is important for judging the significance of the alignment.

Question 1 The human cytochrome c protein sequence is identical to the one in another species. Which species is this?

Question 2 Now go back and change the filter to yeast Saccharomyces cerevisiae (taxid:4932) and redo the analysis. How similar is the yeast sequence to the human sequence? What is the percent identity between the sequences. What is your conclusion, are the sequences in human and yeast homologs?

Interpretation of BLAST results

If you do not find any highly similar results, you may draw the conclusion that the type of protein (or protein family) that the sequence represent does not exist in the analyzed organism. This kind of negative conclusion is of course more robust if more proteins from the same family are tested.

Question 3 The human caspase-9 is a protein involved in apoptosis or programmed cell death. The protein NP_127463 is the human caspase-9, a protein involved in apoptosis (programmed cell death). Your task is to find out if there are similar proteins sequences in Saccharomyces cerevisiae?

Tip

You can enter the accession number in the NP_127463 in the Enter Query Sequence window on the BLAST page.

Make a Protein BLAST search with filtering for Saccharomyces cerevisiae (taxid:4932) as shown below

Question 4

The APAF-1 or Apoptotic Protease Activating Factor 1 protein is a human protein also involved in apoptosis or programmed cell death. We would like to see if there might be homologs to this protein in Saccharomyces cerevisiae

The accession number of the protein sequence is O14727.

Make the same kind of BLAST analysis as before, filtering for Saccharomyces cerevisiae (taxid:4932).

Is the E- value for the best hit for APAF-1 in the Saccharomyces cerevisiae sequences lower or higher than the limit listed in the first section above? Do we have a probable APAF-1 homolog in the Saccharomyces cerevisiae genome? Are the proteins probable orthologs or not?

TP10 - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki

Sequence alignment and similarity

Homologs

Orthologs

Paralogs

Analogs

BLAST

Interpretation of BLAST results

⚠️ GitHub.com Fallback ⚠️

TP10 - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki

Sequence alignment and similarity

Homologs

Orthologs

Paralogs

Analogs

BLAST

Interpretation of BLAST results

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️