Concatenation‐Partitioned Maximum Likelihood inference - Pas-Kapli/CoME-Tutorials GitHub Wiki

Concatenation of filtered data

Go back to your lizard-exercise/phylo folder from yesterday.

Using IQTree2 it is possible to concatenate multiple independent alignments into a super-alignment.

mkdir concatenation
cd concatenation
iqtree2 -p ../../filtered_alignments/ --out-aln concatenated.fas
ls
concatenated.fas  concatenated.fas.nex  concatenated.fas.partitions

This command creates a concatenated alignment in fasta (concatenated.fas) and nexus (concatenated.fas.nex) format by stitching together all the filtered alignments found in the filtered_alignments folder. We will continue with the concatenated.fas file.

Partition file

The command also creates a partition file concatenated.fas.partitions that provides the order of the loci used and the start and end point of each locus. Note below that the names of the partitions correspond to the names of the fasta files used.

DNA, locus_1.fasta.mafft.filter = 1-1431
DNA, locus_10.fasta.mafft.filter = 1432-2277
DNA, locus_11.fasta.mafft.filter = 2278-3381
DNA, locus_12.fasta.mafft.filter = 3382-4203
DNA, locus_13.fasta.mafft.filter = 4204-6393
DNA, locus_15.fasta.mafft.filter = 6394-6990
DNA, locus_17.fasta.mafft.filter = 6991-9234
DNA, locus_18.fasta.mafft.filter = 9235-10068
DNA, locus_19.fasta.mafft.filter = 10069-10887
DNA, locus_20.fasta.mafft.filter = 10888-12333
...

ML inference of concatenated alignment

There are three common ways to estimate branch lengths (sometimes called branch linkage modes):

linked -p: all partitions share a common set of (global) branch lengths. This is the simplest model with the fewest parameters (#branches). However, it is often considered too unrealistic, since it is known that genes (or genome regions) evolve at different speeds.

scaled (proportional) -spp: a global set of branch lengths is estimated like in linked mode, but each partition has an individual scaling factor; per-partition branch lengths are obtained by multiplying the global branch lengths with these individual scalers. This approach is a compromise that allows to model distinct evolutionary rates across partitions while introducing only a moderate number of free parameters (#branches + #partitions).

unlinked -sp: each partition has its own, independent set of branch lengths. This model allows for the highest flexibility, but it also introduces a huge number of free parameters (#branches * #partitions), which makes it prone to overfitting.

For closely related taxa it might make sense to use the linked option while for more divergent taxa the proportional and unlinked would be more sensible. The default option in raxml-ng is the proportional which works well in most cases and would be the most sensible choice for the example dataset.

# IQtree command for running under the "GTR+G" model (~25 minutes single thread)
iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -m GTR+G -pre iqtree-T5 -b 100

# Or download the resulting files: 
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/refs/heads/main/tutorial2/concat-iqtree-gtr.tar.gz
tar -xvzf concat-iqtree-gtr.tar.gz

# IQtree command for running under the best fitting model per partition 
iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -pre iqtree-T6 -b 100 -m TEST

# Or download the resulting files: 
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/refs/heads/main/tutorial2/concat-iqtree-model-sel.tar.gz
tar -xvzf concat-iqtree-model-sel.tar.gz

# RAxML-ng command:
raxml-ng --all --msa concatenated.fas --model concatenated.fas.partitions --bs-trees 100 --brlen scaled --prefix T5

Q: Check the average bootstrap support, did it improve compared to the individual gene-trees? Q: open the tree with figtree or seaview, did all the support values improve?

Compare your inferred trees and to the published topology

#Download the published tree:
cd ../phylo
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/refs/heads/main/tutorial2/figa-topology.tre
iqtree2 -rf figa-topology.tre concat-iqtree-gtr/iqtree-T5.treefile

Gene concordance factor (gCF)

Concordance analysis in IQTree quantifies the proportion of gene trees supporting a specific topology or clade on a reference tree. To perform concordance analysis, you need a set of pre-computed gene trees (in newick format) and the reference topology, in our case the topology inferred based on the concatenated alignment.

mkdir concordance
cd concordance
cat ../all-GTR/*treefile > gene_trees-gtr.nwk
iqtree2 -t ../concat-iqtree-gtr/iqtree-T5.contree --gcf gene_trees-gtr.nwk --prefix concord

Next: Species-Tree inference with Astral