Maximum Likelihood Gene Tree Inference - Pas-Kapli/CoME-Tutorials GitHub Wiki

In the next steps, we will perform phylogenetic inference under Maximum likelihood. There are two very popular tools for this task RAxML-ng and IQtree. The two tools are very similar in their core functions and should provide identical results under identical parameters. The team of RAxML has been focusing more on developing techniques for accelerating the likelihood calculation and covering more efficiently the tree space. The team of IQTree has invested in providing a wider variety of substitution models and auxiliary functions to phylogenetic inference.

We will be using IQtree as the main tool for the tutorial but the equivalent RAxML commands will also be provided throughout the exercise (except when there is no correspondence among the two tools).

Move into the "phylo" folder:

cd ../phylo

Data and Model

Getting help

--help option displays the help menu for both RAxML-ng and IQTree.

RAxML-NG supports alignments in FASTA and PHYLIP formats. IQTree also accepts NEXUS

Selecting a suitable substitution model is not a straightforward task. The conventional approach involves employing automated model selection methods that maximize the likelihood of a given topology. Despite being a widely adopted practice in phylogenetic inference, some studies challenge this practice and show that for nucleotide data using the most complex model (GTR+G) is expected to be as efficient as using the best fitting model based on model selection (Abadi et al., 2019, Darriba and Posada, 2015).

For amino acid data, model selection is even less straightforward given our poor understanding of amino acid exchangeability rules and thus the use of empirical matrices. Recent studies show that automated aa model selection is not necessarily associated with phylogenetic tree accuracy (Spielman, 2020) while Del Amparo and Arenas, 2022 showed that using the best fitting aa model leads to more accurate reconstruction of ancestral states.

Our example sequences represent inter-genus relationships and are relatively divergent. In this case, using GTR+G as our model would be a reasonable option. In an empirical dataset like here we cannot possibly evaluate the effect of the model specification on the resulting phylogeny because we don't know the true relationships. We can however assess whether the model selection makes a difference.

Maximum Likelihood Tree Inference

mkdir first
cd first
iqtree2 -s ../../filtered_alignments/locus_1.fasta.mafft.filtered -m GTR+G -pre T2iq

#RAxML-ng command:
raxml-ng --msa ../../filtered_alignments/locus_1.fasta.mafft.filtered --model GTR+G --prefix T2

Note that if you do not specify the -prefix option the output files will be named after the locus name. In this case this means that the results would go to the "../../filtered-alignments" folder!

This run produced six output files for IQTree:

T2iq.treefile: Best-scoring ML tree
T2iq.bionj: Starting tree(s) used for ML inference
T2iq.log: Screen log
T2iq.iqtree: Provides information about the alignment, the optimised substitution model parameters and the ML tree - also includes information for simulating similar alignments with iqtree2
T2iq.mldist: Pairwise distance matrix
T2iq.ckp.gz: checkpoint file for resuming analysis

Q: Check out the T2iq.treefile with seaview or figtree, does it look like an easy or difficult phylogeny to resolve?

Let's run it for all our alignments (this should take 1-2 minutes):

cd ../
mkdir all-GTR
cd all-GTR
for i in ../../filtered_alignments/locus_*.filtered; do iqtree2 -s $i -m GTR+G; done
mv ../../filtered_alignments/*.filtered.* .

Model Selection & ML Tree Inference

With IQtree it is possible to run model selection before the phylogenetic inference. In fact, it is the default action of the IQTree program if there is no model flag (-m) specified in the command.

Let's run ML search with model selection for all our alignments (this should take 1-2 minutes):

cd ../
mkdir all-modeltest
cd all-modeltest
for i in ../../filtered_alignments/locus_*.filtered; do iqtree2 -s $i; done
mv ../../filtered_alignments/*.filtered.* .

Check one of the *.log files with a terminal editor nano/pico/vi

nano locus_1.fasta.mafft.filtered.log

Q: What is the preferred model?

The default option in iqtree is to compare the likelihood or BIC/AIC scores among different models assuming a parsimony starting tree. The model with the highest BIC score is selected as the best-fitting model.

Let's see what were the best-fitting model selected across the loci:

grep "Best-fit model" *.log | cut -f 3 -d":" | cut -f2 -d" " | sort | uniq -c | sed "s/^ *//g" | sort -n -k1

Check what each models is in the iqtree manual page 139.

Comparison of gene-trees (GTR vs. Best Fitting)

We will perform a comparison among pairs of trees estimated for each locus using the GTR+G and the best fitting model as inferred above.

cd ..
mkdir compare-gene-trees
cd compare-gene-trees
ls ../../filtered_alignments/*filtered | cut -f4 -d'/' > alig-names.txt
head alig-names.txt
locus_10.fasta.mafft.filtered
locus_11.fasta.mafft.filtered
locus_12.fasta.mafft.filtered
locus_13.fasta.mafft.filtered
locus_15.fasta.mafft.filtered
locus_17.fasta.mafft.filtered
locus_18.fasta.mafft.filtered
locus_19.fasta.mafft.filtered
locus_1.fasta.mafft.filtered
locus_20.fasta.mafft.filtered

for i in `cat alig-names.txt`; do iqtree2 -rf ../all-GTR/$i*treefile ../all-modeltest/$i.treefile -pre $i; done
grep Tree0 *rfdist

Q: What possible reasons could be driving this result?

ML Tree Inference and Bootstrap Support Values

To assess the robustness of the ML trees we can use Bootstrap Support values. Bootstrap values are a measure of support for the branches in a phylogenetic tree. They are calculated by resampling the original alignment (with replacement) to create multiple bootstrap datasets, inferring a tree for each, and then determining how often each branch appears across all bootstrap trees. High bootstrap values (e.g., >70%) indicate strong support for a branch, while low values suggest uncertainty.

Let's run ML with bootstrap support calculation for the first locus (this should take ~2 minutes):

cd ../
mkdir bootstrap
cd bootstrap
iqtree2 -s ../../filtered_alignments/locus_1.fasta.mafft.filtered -m GTR+G -pre T4iq -b 100

#RAxML-ng command:
raxml-ng --all --msa ../../filtered_alignments/locus_1.fasta.mafft.filtered --model GTR+G --bs-trees 100 -prefix T4

Additional output files when including bootstrap:

prefix.boottrees: Trees inferred for each bootstrap replicate
prefix.contree: Best-scoring ML tree with bootstrap support values

Q: Check your T4iq.contree file does it seem like a well supported phylogeny? Q: How do you think of the tree comparisons earlier with respect to the substitution model given this information?

Dataset | Alignment | Gene-Tree ML Inference

OPTIONAL -Parallel running of Bootstrap analysis for multiple loci:

Running ML with bootstrap for each of the loci serially (i.e., one after the other with a for loop as we saw for the tree inference earlier) even though fast, will still take a long time.

If you have several CPUs available in your computer or server you can use the parallel command instead:

cd../
mkdir bootstrap-all
cd bootstrap-all

#Store in a file the names of the input files
ls ../../filtered_alignments/ | cut -f3 -d$'\t' > alignment_names.txt

#Execute all the analyses in parallel
cp ../compare-gene-trees/alig-names.txt .
parallel --jobs 36 "iqtree2 -s ../../filtered_alignments/{} -m GTR+G -b 100" < alig-names.txt
mv ../../filtered_alignments/*.filtered.* .

Assuming you have a lot of CPUs available, the above parallel command will execute 36 of the analyses at the same time. BUT be careful, in the case of RAxML-ng, the default option is to optimize the number of threads to use for each dataset.. if you do not specify the number of threads yourself it might overwhelm your machine. To restrict the number of threads use the options --threads and --workers as follows:

# Execute all the analyses in parallel with a single thread each

#parallel --jobs 36 "raxml-ng --all --msa {} --model GTR+G --bs-trees 100 --threads 1 --workers 1" < alig-names.txt

#mv ../../filtered-alignments/*.raxml.* .

Let's now see the average bootstrap support per locus:

# Download the data:
wget https://github.com/Pas-Kapli/phylophysalia2023/raw/refs/heads/main/first-day/bootstrap-all.tar.gz
tar -xvzf bootstrap-all.tar.gz
rm bootstrap-all.tar.gz

for i in *contree; do grep -o ")[0-9][0-9]*:" $i | sed "s/[):]//g" | awk '{ total += $1 } END { print total/NR }' ; done

# RAxML command:
# the same for loop as above but instead of "*contree" we would need to use "*support"

Q: How are the average bootstrap values? Can you think of some reasons why?

Dataset | Alignment | Gene-Tree ML Inference