Inferring maximum likelihood trees - bpp/bpp-tutorial-geneflow GitHub Wiki

We are now ready to perform maximum likelihood phylogenetic inference. There are two popular tools for this task RAxML-ng and IQ-TREE. These tools are very similar in their core functions and should provide identical results under identical parameters. The team of RAxML has been focusing more on developing techniques for accelerating the likelihood calculation and covering more efficiently the tree space. The team of IQ-TREE has invested in providing a wider variety of substitution models and auxiliary functions to phylogenetic inference.

In this tutorial, we will be using IQ-TREE but equivalent RAxML-ng commands will be provided when applicable.

Data and Model

Getting help

--help option displays the help menu for both IQ-TREE and RAxML-ng.

RAxML-NG supports alignments in FASTA and PHYLIP formats. IQ-TREE also accepts NEXUS

Selecting a suitable substitution model is not a straightforward task. The conventional approach involves employing automated model selection methods that maximize the likelihood of a given topology. Despite being a widely adopted practice in phylogenetic inference, some studies challenge this practice and show that for nucleotide data using the most complex model (GTR+G) is expected to be as efficient as using the best fitting model based on model selection (Abadi et al., 2019, Darriba and Posada, 2015).

For amino acid data, model selection is even less straightforward given our poor understanding of amino acid exchangeability rules and thus the use of empirical matrices. Recent studies show that automated aa model selection is not necessarily associated with phylogenetic tree accuracy (Spielman, 2020) while Del Amparo and Arenas, 2022 showed that using the best fitting aa model leads to more accurate reconstruction of ancestral states.

Our example sequences represent inter-genus relationships and are relatively divergent. In this case, using GTR+G as our model would be a reasonable option. In an empirical dataset like here we cannot possibly evaluate the effect of the model specification on the resulting phylogeny because we don't know the true relationships. We can however assess whether the model selection makes a difference.

Maximum Likelihood Tree Inference

Go to phylo folder:

cd ../phylo

Then create a new folder called first:

mkdir first
cd first
iqtree2 -s ../../filtered_alignments/locus_1.fasta.mafft.filtered -m GTR+G -pre T2iq

#RAxML-ng command:
raxml-ng --msa ../../filtered_alignments/locus_1.fasta.mafft.filtered --model GTR+G --prefix T2

Note that if you do not specify the -prefix option the output files will be named after the locus name. In this case this means that the results would go to the "../../filtered-alignments" folder!

This run produced six output files for IQ-TREE:

T2iq.treefile: Best-scoring ML tree
T2iq.bionj: Starting tree(s) used for ML inference
T2iq.log: Screen log
T2iq.iqtree: Provides information about the alignment, the optimised substitution model parameters and the ML tree - also includes information for simulating similar alignments with iqtree2
T2iq.mldist: Pairwise distance matrix
T2iq.ckp.gz: checkpoint file for resuming analysis

Q: Check out the T2iq.treefile with seaview or figtree, does it look like an easy or difficult phylogeny to resolve?

Let's run it for all our alignments (this should take 1-2 minutes):

cd ../
mkdir all-GTR
cd all-GTR
for i in ../../filtered_alignments/locus_*.filtered; do iqtree2 -s $i -m GTR+G; done
mv ../../filtered_alignments/*.filtered.* .

Model Choice

With IQ-TREE it is possible to run model selection before the phylogenetic inference. In fact, it is the default action of IQ-TREE if no model flag (-m) is specified.

Let's run ML search with model selection for all our alignments -- this should take 1-2 minutes:

cd ../
mkdir all-modeltest
cd all-modeltest
for i in ../../filtered_alignments/locus_*.filtered; do iqtree2 -s $i; done
mv ../../filtered_alignments/*.filtered.* .

Have a look at one of the *.log files:

less locus_1.fasta.mafft.filtered.log

Q: What is the best-fitting model?

The default option in IQ-TREE is to compare the likelihood or BIC/AIC scores among different models assuming a parsimony starting tree. The model with the smallest BIC score is selected as the best-fitting model.

Let's see what were the best-fitting model selected across the loci:

grep "Best-fit model" *.log | cut -f 3 -d":" | cut -f2 -d" " | sort | uniq -c

Check what each models is in the IQ-TREE documentation.

Comparison of inferred gene trees under GTR and best-fitting models

We will compare trees estimated under the GTR+G and the best-fitting model for each locus.

First, prepare a working directory.

cd ..
mkdir compare-gene-trees
cd compare-gene-trees

Get names of alignment files in one file:

ls ../../filtered_alignments/ | cut -f3 -d$'\t' > alignment_names.txt

head alignment_names.txt

This should give:

locus_10.fasta.mafft.filtered
locus_11.fasta.mafft.filtered
locus_12.fasta.mafft.filtered
locus_13.fasta.mafft.filtered
locus_15.fasta.mafft.filtered
locus_17.fasta.mafft.filtered
locus_18.fasta.mafft.filtered
locus_19.fasta.mafft.filtered
locus_1.fasta.mafft.filtered
locus_20.fasta.mafft.filtered

Then for each alignment, we use IQ-TREE to compute a pairwise Robinson-Foulds (RF) distance between the two inferred trees, using flag -rf:

for i in `cat alignment_names.txt`; do iqtree2 -rf ../all-GTR/$i*treefile ../all-modeltest/$i.treefile -pre $i; done
grep Tree0 *rfdist

Q: What possible reasons could be driving this result?

Bootstrap Support Values

To assess the robustness of the ML trees, we can calculate bootstrap support values. Bootstrap values are a measure of statistical support for branches in a phylogenetic tree. They are calculated by resampling the original alignment with replacement to create multiple bootstrap datasets, inferring a tree for each, and then determining how often each branch appears across all bootstrap trees. High bootstrap values (say, >70%) indicate a strong support for that branch, while low values suggest uncertainty.

Let's run ML with bootstrap support calculation for the first locus -- this should take ~2 minutes:

cd ../
mkdir bootstrap
cd bootstrap
iqtree2 -s ../../filtered_alignments/locus_1.fasta.mafft.filtered -m GTR+G -pre T4iq -b 100

#RAxML-ng command:
raxml-ng --all --msa ../../filtered-alignments/locus_1.fasta.mafft.filtered --model GTR+G --bs-trees 100 -prefix T4

Additional output files when including bootstrap:

prefix.boottrees: Trees inferred for each bootstrap replicate
prefix.contree: Best-scoring ML tree with bootstrap support values

Q: Check your T4iq.contree file does it seem like a well supported phylogeny?

Q: How do you think of the tree comparisons earlier with respect to the substitution model given this information?

OPTIONAL: Parallel running of Bootstrap analysis for multiple loci:

Running ML with bootstrap for each of the loci serially (i.e., one after the other with a for loop as we saw for the tree inference earlier) even though fast, will still take a long time.

If you have several CPUs available in your computer or server you can use the parallel command instead:

cd ../
mkdir bootstrap-all
cd bootstrap-all

# Store in a file the names of the input files
ls ../../filtered_alignments/ | cut -f3 -d$'\t' > alignment_names.txt

# Execute all the analyses in parallel
cp ../compare-gene-trees/alig-names.txt .
parallel --jobs 36 "iqtree2 -s ../../filtered_alignments/{} -m GTR+G -b 100" < alig-names.txt
mv ../../filtered_alignments/*.filtered.* .

Assuming you have a lot of CPUs available, the above parallel command will execute 36 of the analyses at the same time. BUT be careful, in the case of RAxML-ng, the default option is to optimize the number of threads to use for each dataset. if you do not specify the number of threads yourself it might overwhelm your machine. To restrict the number of threads use the options --threads and --workers as follows:

# Execute all the analyses in parallel with a single thread each

#parallel --jobs 36 "raxml-ng --all --msa {} --model GTR+G --bs-trees 100 --threads 1 --workers 1" < alig-names.txt

#mv ../../filtered-alignments/*.raxml.* .

Let's now see the average bootstrap support per locus:

# Download the data:
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/refs/heads/main/first-day/bootstrap-all.tar.gz
tar -xvzf bootstrap-all.tar.gz
rm bootstrap-all.tar.gz

for i in *contree; do grep -o ")[0-9][0-9]*:" $i | sed "s/[):]//g" | awk '{ total += $1 } END { print total/NR }' ; done

# RAxML command:
# the same for loop as above but instead of `*contree` we would need to use `*support`

Q: How are the average bootstrap values? Can you think of some reasons why?

Next, Concatenation

Previous, Sequence alignment