Pairwise Alignment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Pairwise Alignment
Module 2.1: Pairwise Alignment
Prerequisites: Intro to Biopython & Bioconductor
Next: Multiple Sequence Alignment
1. Concept & Motivation
Pairwise alignment finds the optimal way to “line up” two biological sequences to reveal regions of similarity, which can imply functional, structural, or evolutionary relationships. There are two main types:
- Global alignment (Needleman–Wunsch): Aligns sequences end-to-end, best for sequences of similar length.
- Local alignment (Smith–Waterman): Finds the best matching subsequences, ideal for identifying conserved domains or motifs within larger sequences.
2. Data Description
We’ll work with two synthetic DNA sequences:
Sequence 1: ACCGTGGAT
Sequence 2: ACGTCGAT
These share substantial similarity but differ by insertions, deletions, and mismatches—perfect for demonstrating both global and local alignment.
3. Hands-On Code
Tip: Make sure Biopython is installed (pip install biopython
), then run each block in your Jupyter notebook.
3.1 Import & Define Sequences
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
seq1 = "ACCGTGGAT"
seq2 = "ACGTCGAT"
print("Seq1:", seq1)
print("Seq2:", seq2)
Output:
Seq1: ACCGTGGAT
Seq2: ACGTCGAT
3.2 Global Alignment (Needleman–Wunsch)
# match=2, mismatch=-1, gap open=-0.5, gap extend=-0.1
global_aligns = pairwise2.align.globalms(seq1, seq2, 2, -1, -0.5, -0.1)
# Show top 2 alignments
for aln in global_aligns[:2]:
print(format_alignment(*aln))
Output:
3.3 Local Alignment (Smith–Waterman)
# same scoring parameters, but local alignment
local_aligns = pairwise2.align.localms(seq1, seq2, 2, -1, -0.5, -0.1)
# Show top 2 local alignments
for aln in local_aligns[:2]:
print(format_alignment(*aln))
Output:
3.4 Summarize Alignment Scores
import pandas as pd
# Build a summary table of the best global/local scores
data = [
{"Type": "Global", "Score": global_aligns[0].score},
{"Type": "Local", "Score": local_aligns[0].score}
]
df_scores = pd.DataFrame(data)
df_scores
Output:
4. Interpretation & Discussion
- Global alignment will align the entire length of both sequences, introducing gaps to maximize overall similarity.
- Local alignment finds the highest-scoring matching region; extra flanking regions are omitted.
- Interpretation of scores: Higher positive scores indicate better similarity under your chosen scoring scheme.
- Parameter effects: Changing mismatch penalties or gap costs will alter the alignments—steeper gap-open penalties discourage gaps, for example.
5. Exercises
Percent Identity
Compute the percentage of identical positions in a global alignment:
aln = global_aligns[0]
ident = sum(a==b for a,b in zip(aln.seqA, aln.seqB))
pid = ident / len(aln.seqA) * 100
print(f"Percent identity: {pid:.1f}%")
Output:
Percent identity: 70.0%
Protein Alignment
Translate your DNA sequences to protein and run a protein‐level global alignment using BLOSUM62:
from Bio import SeqIO, pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices
# 0) Read in your toy.fasta
records = list(SeqIO.parse("toy.fasta", "fasta"))
# pull out raw sequence strings for your two records:
seq1 = str(records[0].seq)
seq2 = str(records[1].seq)
# 1) Translate to protein, stopping at first stop codon
prot1 = Seq(seq1).translate(to_stop=True)
prot2 = Seq(seq2).translate(to_stop=True)
# 2) Load BLOSUM62
blosum62 = substitution_matrices.load("BLOSUM62")
# 3) Align with gap penalties
aligns = pairwise2.align.globalds(prot1, prot2, blosum62, -10, -0.5)
# 4) Print best alignment
print(pairwise2.format_alignment(*aligns[0]))
Output:
--MRT
...
RSIDR
Score=-12.5
Parameter Tuning
- Experiment with different gap‐open and gap‐extend penalties (e.g.
-2, -0.1
vs-0.5, -0.05
) and observe effects on alignment and score.
Real Sequences
- Fetch two homologous gene sequences from NCBI Entrez (via Biopython’s
Entrez.efetch
) and align them globally and locally.
End of Module 2.1