Sequence alignment - bpp/bpp-tutorial-geneflow GitHub Wiki

Identifying homologous residues across genes entails aligning the genes, through the addition of gaps within the sequences so that, in the final multiple sequence alignment, the residues in each column of the alignment should have descended from the same ancestral residue.

The most commonly used methods adopt the progressive approach, including Muscle, Clustal and MAFFT. These methods first make a rough estimate of how similar each pair of sequences is and use this information to produce an approximate guide tree of relationships between sequences. They then build up the alignment by first aligning the most similar pair of sequences and progressively adding more distantly related sequences, according to the guide tree, to this fixed alignment. The major drawback of this approach is that the alignment errors early on propagate throughout the alignment of all sequences. To address this problem most of these aligners include refinement steps, in which they improve the guide tree and subsequently the alignment.

In this tutorial we will use MAFFT that has been shown across several independent studies to perform consistently better.

Alignment with MAFFT

There are three major "accuracy-oriented" algorithms for aligning with MAFFT:

L-INS-i (Large - Iterative Refinement method with local pairwise alignment): Combines the local pairwise alignment method with iterative refinement and is recommended for <200 sequences. L-INS-i can align a set of sequences containing sequences flanking around one alignable domain like in the alignment below where 'X's indicate alignable residues, 'o's indicate unalignable residues and '-'s indicate gaps. In several independent studies comparing different alignment methods, MAFFT L-INS-i has been shown to be the most accurate in variable conditions.

ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

E-INS-i (Express - Iterative Refinement method with local pairwise alignment). This approach is suitable for sequences containing large unalignable regions in the middle of the sequences as shown in the alignment below. It is recommended for <200 sequences.

oooooooooXXX------XXXXooooooooooo---------------oooooooXXXXXooooooooooooooooo--oooooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-ooooooooooooooooooooooo-----------
oooooooo-XXXXX----XXXX---------------------------------XXXXX---oooooooooo--oooooooooooo--------
---------XXXXXX---XXXX---------------------------------XXXXX-----------------------------------
---------XXXXXXXXXXXXX---------------------------------XXXXX-----------------------------------
---------XX-------XXXX---------------------------------XXXXX-----------------------------------

G-INS-i (Global - Iterative Refinement method with global pairwise alignment)**: It assumes that the entire regions can be aligned and tries to align them globally using the Needleman-Wunsch algorithm. It is suitable for sequences of similar lengths as shown below.

XXXXXXXXXXX-XXXXXXXXXXXXXXX
XX-XXXXXXXXXXXXXXX-XXXXXXXX
XXXXX----XXXXXXXX---XXXXXXX
XXXXX-XXXXXXXXXX----XXXXXXX
XXXXXXXXXXXXXXXX----XXXXXXX

For more details check here and here.

We will use the L-INS-i algorithm for our alignments:

Go to the lizard-exercise/data folder where the sequence alignments (fasta) are. Then run MAFFT on the first locus:

mafft --localpair --maxiterate 1000 --preservecase locus_1.fasta > locus_1.fasta.mafft

With --preservecase, we maintain the upper lower case of the sequences, if this is not defined mafft will convert all sequences to lower case.

Now let's run mafft for all our loci - this should take about 2-5 minutes:

for f in *.fasta; do mafft --localpair --maxiterate 1000 --preservecase $f > $f.mafft; done
mv *mafft ../alignments
cd ../alignments

# Download the alignments if you prefer:
# wget -cO - https://github.com/bpp/bpp-tutorial-geneflow/raw/main/first-day/alignments.tar.gz?raw=true > alignments.tar.gz
# tar -xvzf alignments.tar.gz
# rm alignments.tar.gz

Use alan to display one of the alignments in the terminal:

alan locus_1.fasta.mafft

You will now notice that the sequences are now aligned!

Alignment Filtering

In phylogenetics, we are interested in accurately reconstructing orthologous sequences and often aim to avoid misaligned areas that might assume false site homology that might affect phylogenetic inference. A common strategy is to filter ambiguously aligned regions. Filtering can be based on ad hoc criteria regarding alignment quality such as gappyness and sequence similarity or by retaining only the alignment positions that are robust to changes in alignment parameters. Reports on the impact of alignment trimming on the quality of downstream phylogenetic analysis vary (e.g., Tan et al., 2015, Talavera and Castresana, 2007), and hence trimming using automated pipelines should be used cautiously.

We will use a custom python script for a very mild trimming approach that aims to remove extremely divergent sequences (>40%) and loci with mostly missing data (>50%, individual sequences and entire alignments).

Go to the filtered_alignments folder and download the filtering.py script:

cd ../filtered_alignments
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/refs/heads/main/first-day/filtering.py
chmod u+x filtering.py

Let's do a test run on a single alignment:

./filtering.py ../alignments/locus_10.fasta.mafft test.fasta.filtered --diff_threshold 0.4 --missing_threshold 0.5 --alignment_missing_threshold 0.5

This creates a filtered alignment. Have a look and see if it makes sense. Then remove this test file:

rm test.fasta.filter

To see available options, type:

./filtering.py --help

To filter all of the alignments, run:

for i in ../alignments/*mafft; do ./filtering.py $i $(basename "$i").filtered --diff_threshold 0.4 --missing_threshold 0.5 --alignment_missing_threshold 0.5; done

mv filtering.py ../

Q: How many loci and sequences were removed?

Next, Inferring maximum likelihood gene trees

Previous, Dataset