Multiple Sequence Alignment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Module 2.2: Multiple Sequence Alignment (in R)

Prerequisites: Pairwise Alignment
Next: Profile HMMs & Motif Finding


1. Purpose

Perform a multiple sequence alignment (MSA) of three related protein sequences in R, using Bioconductor’s msa package (which wraps Clustal Omega, MUSCLE and ClustalW). We will:

  1. Read an input FASTA (msa.fasta) containing three sequences.
  2. Align them with Clustal Omega.
  3. Inspect the alignment object and derive a consensus sequence.
  4. Save the aligned FASTA for downstream use.

2. Dependencies & Installation

Run this once in an R console (or the first notebook cell):

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install(c("msa","Biostrings"))

  • msa: performs MSA (ClustalW/Ω, MUSCLE)
  • Biostrings: handles FASTA I/O and sequence object

3. Full R Code

# Create `msa.fasta` with Three Protein Sequences

Copy this R code into RStudio (or any R console) to generate a FASTA file named **`msa.fasta`** in your current working directory. You can also paste it directly into your GitHub Wiki (in a fenced R code block).

```r
# 1) Define three protein sequences as a named character vector
seqs <- c(
  Seq1 = "MVLSPADKTNVKAAWGKLG",
  Seq2 = "MVLSVADKTNVKAAWGKVG",
  Seq3 = "AVLSPADKTNVKAGWGKVG"
)

# 2) Convert to FASTA-format lines
fasta_lines <- unlist(
  lapply(names(seqs), function(name) {
    c(paste0(">", name), seqs[name])
  })
)

# 3) Write the lines out to 'msa.fasta'
writeLines(fasta_lines, "msa.fasta")

# 4) (Optional) Print the contents to verify
cat(readLines("msa.fasta"), sep = "\n")

Output:

image

library(msa)
library(Biostrings)

# 3.1 Read your input FASTA
# Ensure 'msa.fasta' lives in your working directory:
# >Seq1
# MVLSPADKTNVKAAWGKLG
# >Seq2
# MVLSVADKTNVKAAWGKVG
# >Seq3
# AVLSPADKTNVKAGWGKVG
seqs <- readAAStringSet("msa.fasta")

# 3.2 Perform the alignment with Clustal Omega
alignment <- msa(seqs, method="ClustalOmega", type="protein")

# 3.3 Inspect the alignment summary
alignment

Output:

image

5. Interpretation

  • Gaps (-) represent insertions/deletions needed to maximize column‐wise similarity.
  • Conserved positions (e.g., SLPADKTNVKAAW) highlight the core motif across all three sequences.
  • The consensus string picks the majority residue at each column (with threshold = 0.5).
  • Exporting msa_aligned.fasta lets you feed the alignment into profile‐HMM builders or phylogenetic tools.

End of Module 2.2 (R version)