Concatenation - bpp/bpp-tutorial-geneflow GitHub Wiki
We will infer a maximum-likelihood (ML) tree from a concatenated alignment of all loci.
Concatenated alignment
We can use IQ-TREE to concatenate multiple alignments into a single alignment. Make sure you are still in the phylo
folder.
cd ../../concatenation
iqtree2 -p ../filtered_alignments/ --out-aln concatenated.fas
Check the output files by typing ls
:
concatenated.fas concatenated.fas.nex concatenated.fas.partitions
This creates a concatenated alignment in fasta (concatenated.fas
) and nexus (concatenated.fas.nex
) formats by stitching together all the filtered alignments found in the filtered_alignments
folder. We will work with concatenated.fas
.
Partition file
The command also creates a partition file concatenated.fas.partitions
that provides a list of positions of individual loci.
DNA, locus_1.fasta.mafft.filter = 1-1431
DNA, locus_10.fasta.mafft.filter = 1432-2277
DNA, locus_11.fasta.mafft.filter = 2278-3381
DNA, locus_12.fasta.mafft.filter = 3382-4203
DNA, locus_13.fasta.mafft.filter = 4204-6393
DNA, locus_15.fasta.mafft.filter = 6394-6990
DNA, locus_17.fasta.mafft.filter = 6991-9234
DNA, locus_18.fasta.mafft.filter = 9235-10068
DNA, locus_19.fasta.mafft.filter = 10069-10887
DNA, locus_20.fasta.mafft.filter = 10888-12333
...
Concatenated ML tree
There are three common ways to estimate branch lengths of a partitioned alignment, sometimes called branch linkage modes:
linked -q
: all partitions share a common set of (global) branch lengths. This is the simplest model with the fewest parameters (#branches). However, it is often considered too unrealistic since it is known that different genes or genome regions can evolve at different speeds.
scaled (proportional) -p
: a global set of branch lengths is estimated like in linked mode, but each partition has an individual scaling factor; per-partition branch lengths are obtained by multiplying the global branch lengths with these individual scalers. This approach is a compromise that allows to model distinct evolutionary rates across partitions while introducing only a moderate number of free parameters (number of branches + number of partitions).
unlinked -Q
: each partition has its own independent set of branch lengths. This model allows for the highest flexibility, but it also introduces a huge number of free parameters (number of branches x number of partitions), which makes it prone to overfitting.
For closely related taxa, it maybe sensible to use the linked
option (-p
) while for more divergent taxa the proportional
and unlinked
can be more sensible. The default option in RAxML-NG is proportional
, which works well in most cases and would be the most sensible choice for the example dataset.
Here, we will run IQ-TREE in the proportional
model under the GTR+G model. This takes ~25 minutes on a single thread.
iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -m GTR+G -pre iqtree-T5 -b 100
For a faster run using approximate bootstrap, use
iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -m GTR+G -pre iqtree-T5 -B 1000
The resulting files:
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/refs/heads/main/first-day/concat-iqtree-gtr.tar.gz
tar -xvzf concat-iqtree-gtr.tar.gz
Alternatively, we can use the best-fitting model for each partition:
iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -pre iqtree-T6 -b 100 -m TEST
The resulting files:
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/refs/heads/main/first-day/concat-iqtree-model-sel.tar.gz
tar -xvzf concat-iqtree-model-sel.tar.gz
RAxML-NG command:
raxml-ng --all --msa concatenated.fas --model concatenated.fas.partitions --bs-trees 100 --brlen scaled --prefix T5
For the linked and unlinked model, use --brlen linked
and --brlen unlinked
, respectively.
Q: Check the average bootstrap support, did it improve compared to the individual gene-trees?
Q: Open the tree with figtree or seaview, did all the support values improve?
Compare your inferred tree to the published tree
Download the published tree:
cd ../phylo
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/first-day/figa-topology.tre
Compute the RF distance:
iqtree2 -rf figa-topology.tre concat-iqtree-gtr/iqtree-T5.treefile
Gene concordance factor (gCF)
Concordance analysis in IQ-TREE quantifies the proportion of gene trees supporting a specific topology or clade on a reference tree. For this, we need a set of estimated gene trees in newick format and the reference topology, in our case the topology inferred based on the concatenated alignment.
mkdir concordance
cd concordance
cat ../../phylo/all-GTR/*treefile > gene_trees-gtr.nwk
iqtree2 -t ../concat-iqtree-gtr/iqtree-T5.contree --gcf gene_trees-gtr.nwk --prefix concord
Optional topics
Number of starting trees
--tree pars{N},rand{M}
option specifies the number of starting trees
Although the default settings is a reasonable choice for many practical cases, computational resources permitting, we might want to increase the number of starting trees to explore the tree space more thoroughly:
# RAxML-NG
raxml-ng --msa concatenated.fas --model partitions.txt --prefix T6 --tree pars{15},rand{15}
# IQ-TREE
iqtree -s concatenated.fas -spp partitions.txt -pre T6 -ntop 15
Conversely, we can perform a quick-and-dirty search from a single random starting tree using the --search1
command:
# RAxML-NG
raxml-ng --search1 --msa concatenated.fas --model partitions.txt --prefix T7
See the difference in final log-likelihood with the following command:
grep "Final LogLikelihood:" T*log
There are some differences but are they because of the topology?
--rfdist
computes topological Robinson-Foulds (RF) distances among the trees
We can check this by using the --rfdist
command to compute the topological RF distance among all trees:
cat *bestTree > mltrees
raxml-ng --rfdist --tree mltrees --prefix RF2
Use cat
to examine the .rfDistances
file.
Q: are there topological differences this time? what does this tell us?
Parallelization
By default, RAxML-NG will use a set of heuristics to determine the most efficient parallelization scheme given the specific dataset, analysis parameters and hardware configuration at hand. If multiple tree inferences are requested (e.g., with --search or --all command), then RAxML-NG will consider running multiple searches ("workers") in parallel (coarse-grained parallelization). If the alignment is long then it might be suitable to run with multiple threads (fine-grained parallelization).
It will automatically determine the "reasonable" number of threads based on the total number of tree searches, alignment length, available memory, and CPU cores.
Q Check in the concatenation.raxml.log file, how many threads did it decide to use in this case?
Try alternative parallelization schemes to test whether it will run faster.
Previous, Inferring maximum-likelihood gene trees