Concatenation - bpp/bpp-tutorial-geneflow GitHub Wiki

We will infer a maximum-likelihood (ML) tree from a concatenated alignment of all loci.

Concatenated alignment

We can use IQ-TREE to concatenate multiple alignments into a single alignment. Make sure you are still in the phylo folder.

cd ../../concatenation
iqtree2 -p ../filtered_alignments/ --out-aln concatenated.fas

Check the output files by typing ls:

concatenated.fas  concatenated.fas.nex  concatenated.fas.partitions

This creates a concatenated alignment in fasta (concatenated.fas) and nexus (concatenated.fas.nex) formats by stitching together all the filtered alignments found in the filtered_alignments folder. We will work with concatenated.fas.

Partition file

The command also creates a partition file concatenated.fas.partitions that provides a list of positions of individual loci.

DNA, locus_1.fasta.mafft.filter = 1-1431
DNA, locus_10.fasta.mafft.filter = 1432-2277
DNA, locus_11.fasta.mafft.filter = 2278-3381
DNA, locus_12.fasta.mafft.filter = 3382-4203
DNA, locus_13.fasta.mafft.filter = 4204-6393
DNA, locus_15.fasta.mafft.filter = 6394-6990
DNA, locus_17.fasta.mafft.filter = 6991-9234
DNA, locus_18.fasta.mafft.filter = 9235-10068
DNA, locus_19.fasta.mafft.filter = 10069-10887
DNA, locus_20.fasta.mafft.filter = 10888-12333
...

Concatenated ML tree

There are three common ways to estimate branch lengths of a partitioned alignment, sometimes called branch linkage modes:

linked -q: all partitions share a common set of (global) branch lengths. This is the simplest model with the fewest parameters (#branches). However, it is often considered too unrealistic since it is known that different genes or genome regions can evolve at different speeds.

scaled (proportional) -p: a global set of branch lengths is estimated like in linked mode, but each partition has an individual scaling factor; per-partition branch lengths are obtained by multiplying the global branch lengths with these individual scalers. This approach is a compromise that allows to model distinct evolutionary rates across partitions while introducing only a moderate number of free parameters (number of branches + number of partitions).

unlinked -Q: each partition has its own independent set of branch lengths. This model allows for the highest flexibility, but it also introduces a huge number of free parameters (number of branches x number of partitions), which makes it prone to overfitting.

For closely related taxa, it maybe sensible to use the linked option (-p) while for more divergent taxa the proportional and unlinked can be more sensible. The default option in RAxML-NG is proportional, which works well in most cases and would be the most sensible choice for the example dataset.

Here, we will run IQ-TREE in the proportional model under the GTR+G model. This takes ~25 minutes on a single thread.

iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -m GTR+G -pre iqtree-T5 -b 100

For a faster run using approximate bootstrap, use

iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -m GTR+G -pre iqtree-T5 -B 1000

The resulting files:

wget https://github.com/bpp/bpp-tutorial-geneflow/raw/refs/heads/main/first-day/concat-iqtree-gtr.tar.gz
tar -xvzf concat-iqtree-gtr.tar.gz

Alternatively, we can use the best-fitting model for each partition:

iqtree2 -s concatenated.fas -spp concatenated.fas.partitions -pre iqtree-T6 -b 100 -m TEST

The resulting files:

wget https://github.com/bpp/bpp-tutorial-geneflow/raw/refs/heads/main/first-day/concat-iqtree-model-sel.tar.gz
tar -xvzf concat-iqtree-model-sel.tar.gz

RAxML-NG command:

raxml-ng --all --msa concatenated.fas --model concatenated.fas.partitions --bs-trees 100 --brlen scaled --prefix T5

For the linked and unlinked model, use --brlen linked and --brlen unlinked, respectively.

Q: Check the average bootstrap support, did it improve compared to the individual gene-trees?

Q: Open the tree with figtree or seaview, did all the support values improve?

Compare your inferred tree to the published tree

Download the published tree:

cd ../phylo
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/first-day/figa-topology.tre

Compute the RF distance:

iqtree2 -rf figa-topology.tre concat-iqtree-gtr/iqtree-T5.treefile

Gene concordance factor (gCF)

Concordance analysis in IQ-TREE quantifies the proportion of gene trees supporting a specific topology or clade on a reference tree. For this, we need a set of estimated gene trees in newick format and the reference topology, in our case the topology inferred based on the concatenated alignment.

mkdir concordance
cd concordance
cat ../../phylo/all-GTR/*treefile > gene_trees-gtr.nwk
iqtree2 -t ../concat-iqtree-gtr/iqtree-T5.contree --gcf gene_trees-gtr.nwk --prefix concord

Optional topics

Number of starting trees

--tree pars{N},rand{M} option specifies the number of starting trees

Although the default settings is a reasonable choice for many practical cases, computational resources permitting, we might want to increase the number of starting trees to explore the tree space more thoroughly:

# RAxML-NG
raxml-ng --msa concatenated.fas --model partitions.txt --prefix T6 --tree pars{15},rand{15}

# IQ-TREE
iqtree -s concatenated.fas -spp partitions.txt -pre T6 -ntop 15

Conversely, we can perform a quick-and-dirty search from a single random starting tree using the --search1 command:

# RAxML-NG
raxml-ng --search1 --msa concatenated.fas --model partitions.txt --prefix T7

See the difference in final log-likelihood with the following command:

grep "Final LogLikelihood:" T*log

There are some differences but are they because of the topology? --rfdist computes topological Robinson-Foulds (RF) distances among the trees

We can check this by using the --rfdist command to compute the topological RF distance among all trees:

cat *bestTree > mltrees
raxml-ng --rfdist --tree mltrees --prefix RF2

Use cat to examine the .rfDistances file.

Q: are there topological differences this time? what does this tell us?

Parallelization

By default, RAxML-NG will use a set of heuristics to determine the most efficient parallelization scheme given the specific dataset, analysis parameters and hardware configuration at hand. If multiple tree inferences are requested (e.g., with --search or --all command), then RAxML-NG will consider running multiple searches ("workers") in parallel (coarse-grained parallelization). If the alignment is long then it might be suitable to run with multiple threads (fine-grained parallelization).

It will automatically determine the "reasonable" number of threads based on the total number of tree searches, alignment length, available memory, and CPU cores.

Q Check in the concatenation.raxml.log file, how many threads did it decide to use in this case?

Try alternative parallelization schemes to test whether it will run faster.

Previous, Inferring maximum-likelihood gene trees