Alignment - Pas-Kapli/CoME-Tutorials GitHub Wiki

Background

Identifying homologous residues across genes entails aligning the genes, through the addition of gaps within the sequences so that, in the final multiple sequence alignment, the residues in each column of the alignment should have descended from the same ancestral residue.

The most commonly used methods adopt the progressive approach, including Muscle, Clustal and MAFFT. These methods first make a rough estimate of how similar each pair of sequences is and use this information to produce an approximate guide tree of relationships between sequences. They then build up the alignment by first aligning the most similar pair of sequences and progressively adding more distantly related sequences, according to the guide tree, to this fixed alignment. The major drawback of this approach is that the alignment errors early on propagate throughout the alignment of all sequences. To address this problem most of these aligners include refinement steps, in which they improve the guide tree and subsequently the alignment.

In this tutorial we will use mafft that has been shown across several independent studies to perform consistently better.

Alignment with mafft

make sure you are in the "alignments" subfolder. There are three major "accuracy-oriented" algorithms for aligning with Mafft:

L-INS-i (Large - Iterative Refinement method with local pairwise alignment): Combines the local pairwise alignment method with iterative refinement and is recommended for <200 sequences. L-INS-i can align a set of sequences containing sequences flanking around one alignable domain like in the alignment below where 'X's indicate alignable residues, 'o's indicate unalignable residues and '-'s indicate gaps. In several independent studies comparing different alignment methods, mafft L-INS-i has been shown to be the most accurate in variable conditions.

ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

E-INS-i (Express - Iterative Refinement method with local pairwise alignment). This approach is suitable for sequences containing large unalignable regions in the middle of the sequences as shown in the alignment below. It is recommended for <200 sequences.

oooooooooXXX------XXXXooooooooooo---------------oooooooXXXXXooooooooooooooooo--oooooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-ooooooooooooooooooooooo-----------
oooooooo-XXXXX----XXXX---------------------------------XXXXX---oooooooooo--oooooooooooo--------
---------XXXXXX---XXXX---------------------------------XXXXX-----------------------------------
---------XXXXXXXXXXXXX---------------------------------XXXXX-----------------------------------
---------XX-------XXXX---------------------------------XXXXX-----------------------------------

G-INS-i (Global - Iterative Refinement method with global pairwise alignment)**: It assumes that the entire regions can be aligned and tries to align them globally using the Needleman-Wunsch algorithm. It is suitable for sequences of similar lengths as shown below.

XXXXXXXXXXX-XXXXXXXXXXXXXXX
XX-XXXXXXXXXXXXXXX-XXXXXXXX
XXXXX----XXXXXXXX---XXXXXXX
XXXXX-XXXXXXXXXX----XXXXXXX
XXXXXXXXXXXXXXXX----XXXXXXX

For more details check here and here.

We will use the L-INS-i algorithm for our alignments:

# Make sure you are in the "data" folder 
pwd
# This should return:
~/lizard-exercise/data

# If you are in the right folder run mafft

mafft --localpair --maxiterate 1000 --preservecase locus_1.fasta > locus_1.fasta.mafft

# with "--preservecase" we maintain the upper lower case of the sequences, if this is not defined mafft will convert all sequences to lower case.

Now let's run mafft for all our loci - this should take about 2-5 minutes:

for i in *fasta; do mafft --localpair --maxiterate 1000 --preservecase $i > $i.mafft; done
mv *mafft ../alignments
cd ../alignments

# Download the alignments if you prefer:
# wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/refs/heads/main/tutorial1/alignments.tar.gz
# tar -xvzf alignments.tar.gz
# rm alignments.tar.gz

Use alan to display one of the alignments in the terminal:

alan locus_1.fasta.mafft

You will now notice that the sequences are now aligned!

Alignment Filtering

In phylogenetics, we are interested in accurately reconstructing orthologous sequences and often aim to avoid misaligned areas that might assume false site homology that might affect phylogenetic inference. A common strategy is to filter ambiguously aligned regions. Filtering can be based on ad hoc criteria regarding alignment quality such as gappyness and sequence similarity or by retaining only the alignment positions that are robust to changes in alignment parameters. Reports on the impact of alignment trimming on the quality of downstream phylogenetic analysis vary (e.g., Tan et al., 2015, Talavera and Castresana, 2007), and hence trimming using automated pipelines should be used cautiously.

We will use a custom Python script for a mild trimming approach that removes highly divergent sequences (average pairwise differences >40%) and filters out columns and alignments with excessive missing data (>50% missing in individual sites, columns, or entire alignments), keeping only alignments with at least 100 non-missing positions.

Go to the "filtered_alignments" folder and download the filtering.py

cd ../filtered_alignments
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/refs/heads/main/tutorial1/filtering.py
chmod +x filtering.py

For filtering a single alignment:

./filtering.py ../alignments/locus_10.fasta.mafft test.fasta.filtered --diff_threshold 0.4 --missing_threshold 0.5 --alignment_missing_threshold 0.5 --column_missing_threshold 0.5 --format fasta --min_alignment_length 100

rm test.fasta.filtered

For filtering all of the alignments with a for loop:

for i in ../alignments/*mafft; do ./filtering.py $i $(basename "$i").filtered --diff_threshold 0.4 --missing_threshold 0.5 --alignment_missing_threshold 0.5 --column_missing_threshold 0.5 --format fasta --min_alignment_length 100; done

mv filtering.py ../

Q: Where any loci or sequences removed?

Next, Gene Tree reconstruction with Maximum Likelihood

Dataset | Alignment | Gene-Tree ML Inference