Pairwise Alignment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Pairwise Alignment

Module 2.1: Pairwise Alignment
Prerequisites: Intro to Biopython & Bioconductor
Next: Multiple Sequence Alignment

1. Concept & Motivation

Pairwise alignment finds the optimal way to “line up” two biological sequences to reveal regions of similarity, which can imply functional, structural, or evolutionary relationships. There are two main types:

Global alignment (Needleman–Wunsch): Aligns sequences end-to-end, best for sequences of similar length.
Local alignment (Smith–Waterman): Finds the best matching subsequences, ideal for identifying conserved domains or motifs within larger sequences.

2. Data Description

We’ll work with two synthetic DNA sequences:

Sequence 1: ACCGTGGAT
Sequence 2: ACGTCGAT

These share substantial similarity but differ by insertions, deletions, and mismatches—perfect for demonstrating both global and local alignment.

3. Hands-On Code

Tip: Make sure Biopython is installed (pip install biopython), then run each block in your Jupyter notebook.

3.1 Import & Define Sequences

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seq1 = "ACCGTGGAT"
seq2 = "ACGTCGAT"
print("Seq1:", seq1)
print("Seq2:", seq2)

Output:

Seq1: ACCGTGGAT
Seq2: ACGTCGAT

3.2 Global Alignment (Needleman–Wunsch)

# match=2, mismatch=-1, gap open=-0.5, gap extend=-0.1
global_aligns = pairwise2.align.globalms(seq1, seq2, 2, -1, -0.5, -0.1)

# Show top 2 alignments
for aln in global_aligns[:2]:
    print(format_alignment(*aln))

Output:

3.3 Local Alignment (Smith–Waterman)

# same scoring parameters, but local alignment
local_aligns = pairwise2.align.localms(seq1, seq2, 2, -1, -0.5, -0.1)

# Show top 2 local alignments
for aln in local_aligns[:2]:
    print(format_alignment(*aln))

Output:

3.4 Summarize Alignment Scores

import pandas as pd

# Build a summary table of the best global/local scores
data = [
    {"Type": "Global", "Score": global_aligns[0].score},
    {"Type": "Local",  "Score": local_aligns[0].score}
]
df_scores = pd.DataFrame(data)
df_scores

Output:

4. Interpretation & Discussion

Global alignment will align the entire length of both sequences, introducing gaps to maximize overall similarity.
Local alignment finds the highest-scoring matching region; extra flanking regions are omitted.
Interpretation of scores: Higher positive scores indicate better similarity under your chosen scoring scheme.
Parameter effects: Changing mismatch penalties or gap costs will alter the alignments—steeper gap-open penalties discourage gaps, for example.

5. Exercises

Percent Identity
Compute the percentage of identical positions in a global alignment:

aln = global_aligns[0]
ident = sum(a==b for a,b in zip(aln.seqA, aln.seqB))
pid = ident / len(aln.seqA) * 100
print(f"Percent identity: {pid:.1f}%")

Output:

Percent identity: 70.0%

Protein Alignment

Translate your DNA sequences to protein and run a protein‐level global alignment using BLOSUM62:

from Bio import SeqIO, pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices

# 0) Read in your toy.fasta
records = list(SeqIO.parse("toy.fasta", "fasta"))
# pull out raw sequence strings for your two records:
seq1 = str(records[0].seq)
seq2 = str(records[1].seq)

# 1) Translate to protein, stopping at first stop codon
prot1 = Seq(seq1).translate(to_stop=True)
prot2 = Seq(seq2).translate(to_stop=True)

# 2) Load BLOSUM62
blosum62 = substitution_matrices.load("BLOSUM62")

# 3) Align with gap penalties
aligns = pairwise2.align.globalds(prot1, prot2, blosum62, -10, -0.5)

# 4) Print best alignment
print(pairwise2.format_alignment(*aligns[0]))

Output:

--MRT
  ...
RSIDR
  Score=-12.5

Parameter Tuning

Experiment with different gap‐open and gap‐extend penalties (e.g. -2, -0.1 vs -0.5, -0.05) and observe effects on alignment and score.

Real Sequences

Fetch two homologous gene sequences from NCBI Entrez (via Biopython’s Entrez.efetch) and align them globally and locally.

End of Module 2.1