Multiple Sequence Alignment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Module 2.2: Multiple Sequence Alignment (in R)
Prerequisites: Pairwise Alignment
Next: Profile HMMs & Motif Finding
1. Purpose
Perform a multiple sequence alignment (MSA) of three related protein sequences in R, using Bioconductor’s msa package (which wraps Clustal Omega, MUSCLE and ClustalW). We will:
- Read an input FASTA (
msa.fasta
) containing three sequences. - Align them with Clustal Omega.
- Inspect the alignment object and derive a consensus sequence.
- Save the aligned FASTA for downstream use.
2. Dependencies & Installation
Run this once in an R console (or the first notebook cell):
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install(c("msa","Biostrings"))
- msa: performs MSA (ClustalW/Ω, MUSCLE)
- Biostrings: handles FASTA I/O and sequence object
3. Full R Code
# Create `msa.fasta` with Three Protein Sequences
Copy this R code into RStudio (or any R console) to generate a FASTA file named **`msa.fasta`** in your current working directory. You can also paste it directly into your GitHub Wiki (in a fenced R code block).
```r
# 1) Define three protein sequences as a named character vector
seqs <- c(
Seq1 = "MVLSPADKTNVKAAWGKLG",
Seq2 = "MVLSVADKTNVKAAWGKVG",
Seq3 = "AVLSPADKTNVKAGWGKVG"
)
# 2) Convert to FASTA-format lines
fasta_lines <- unlist(
lapply(names(seqs), function(name) {
c(paste0(">", name), seqs[name])
})
)
# 3) Write the lines out to 'msa.fasta'
writeLines(fasta_lines, "msa.fasta")
# 4) (Optional) Print the contents to verify
cat(readLines("msa.fasta"), sep = "\n")
Output:
library(msa)
library(Biostrings)
# 3.1 Read your input FASTA
# Ensure 'msa.fasta' lives in your working directory:
# >Seq1
# MVLSPADKTNVKAAWGKLG
# >Seq2
# MVLSVADKTNVKAAWGKVG
# >Seq3
# AVLSPADKTNVKAGWGKVG
seqs <- readAAStringSet("msa.fasta")
# 3.2 Perform the alignment with Clustal Omega
alignment <- msa(seqs, method="ClustalOmega", type="protein")
# 3.3 Inspect the alignment summary
alignment
Output:
5. Interpretation
- Gaps (
-
) represent insertions/deletions needed to maximize column‐wise similarity. - Conserved positions (e.g.,
SLPADKTNVKAAW
) highlight the core motif across all three sequences. - The consensus string picks the majority residue at each column (with threshold = 0.5).
- Exporting
msa_aligned.fasta
lets you feed the alignment into profile‐HMM builders or phylogenetic tools.
End of Module 2.2 (R version)