Lab 6: Phylogenetic Reinference via Maximum Likelihood - ricket-sjtu/bioinformatics GitHub Wiki

Learning Objectives

After this lab, you will

Grasp the underlying theory of maximum likelihood approach for phylogenetic inference
Know how to choose the evolutionary models
Understand the input and output of phylogenetic analyses
Use phyml for conducting phylogenetic reinference
Draw trees using figtree.

1. PhyML Tutorial

1.1 Menu

First menu

[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis

[D] .......................... Data type (DNA/AA/Generic)  DNA
[I] ......... Input sequences interleaved (or sequential)  interleaved
[M] .......................... Analyze multiple data sets  no
[R] .............................................. Run ID  none

Second menu

[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis

[M] .................... Model of nucleotide substitution  HKY85
[F] .................... Optimise equilibrium frequencies  yes
[T] ....................... Ts/tv ratio (fixed/estimated)  estimated
[V] .... Proportion of invariable sites (fixed/estimated)  fixed (p-invar=0.00)
[R] .......... One category of substitution rate (yes/no)  no
[C] .............. Number of substitution rate categories  4
[G] ...... Gamma distribution parameter (fixed/estimated)  estimated
[O] .............................. Optimise tree topology  yes
[U] ........... Starting tree (BioNJ/parsimony/user tree)  BioNJ
[S] ..................... Tree toloLOGy search operations  NNI moves (fast, approximate)
[B] ................... Non parametric bootstrap analysis  no
[A] ................... Approximate likelihood ratio test  yes / SH-like supports

1.2 Model of nucleotide substitution

JC69
K80
F81
HKY85
F84
TN93
GTR
Custom (*): 000000, 012345

1.3 Equilibrium frequencies of nucletides

Estimated empirically from the used datasets
Optimized with a maximum likelihood (ML) approach

1.4 Transition/Transversion (ts/tv) ratio

Fixed
Estimated through maximum likelihood

1.5 Proportion of invariable sites

Estimated
Fixed (.01)

1.6 Tree topology search strategy

NNI moves (quick, approximate): nearest neighbor interchange
SPR moves (slow, accurate): subtree pruning and regrafting
Best of NNI and SPR

1.7 Non-parametric bootstrap analysis

no
Number of replicates

1.8 Approximate likelihood ratio test

aLRT statistics
Chi2-based supports
aBayes supports

2. PhyML Command Line

You can type in

phyml --help

to see the following phyml manual:

-i|--input seq_file_name: nt or aa file in PHYLIP format; (输入MSA文件)
-d|--datatype data_type: nt - nucleotide (default); aa - amino acids; generic (数据类型)
-q|--sequential: changes interleaved format (default) to sequential format (MSA的存在方式)
-n|--multiple nb_data_sets: number of datasets to analyze
-p|--pars: use minimum parsimony starting tree (用最大简约树作为初始树)
-b|--bootstrap int: number of bootstrap replicates (重采样次数)
- >0: number of bootstrap
- 0: neither aLRT nor boostrap values are computed
- -1: aLRT statistics for aLRT test
- -2: Chi2-based parametric branch supports for aLRT test
- -4: (default) SH-like branch supports alone
- -5: approximate Bayes branch supports
-m|--model model: substitution model name
- nt-based model: HKY85 (default), JC69, K80, F81, F84, TN93, GTR, custom (*)
- aa-based model: LG (default), WAG, JTT, MtREV, Dayhoff, DCMut, RtREV, CpREV, VT, Blosum62, MtMam, MtArt, HIVw, HIVb, custom
--aa_rate_file filename: aa subtitution rate matrix in PAML format.
(-f e|m) | (fA,fC,fG,fT): __e__mpirical base/aa frequencies, estimated by __m__l（稳态核苷酸比例）
-t|--ts/tv ratio: transition/transversion ratio, m to optimize, float to fixed.（转换/颠换速率比）
-v|--pinv prop_inv: proportion of invariable sites (保守位点的比例)
-c|--nclasses nb_subst_cat: number of relative substitution rate categories, default 4. (相对替换速率的分类)
-a|--alpha alpha: gamma shape parameter, e to mle, or fixed positive value. (伽马分布的参数alpha)
-s|--search move: NNI|SPR|BEST (搜索树拓扑结构的方法)
-u|--inputtree user_tree_file: starting tree file (初始树文件) in Newick format
-o params: optimizes specific parameters:
- tlr: tree topology (t, 拓扑结构) + branch length (l) + rate parameter (r)
- tl: tree topology + branch length
- lr: branch length (枝长) + rate parameters (速率参数)
- l: branch length
- r: rate parameters
- n: no parameter
--rand_start: sets the initial tree to random, only for SPR search (初始树随机产生)
n_rand_starts num: number of initial trees to be used. (随机初始树的个数)
--r_seed num: random number generator (随机数发生器的seed)
--print_site_lnl: print the likelihood for each site in file *_phyml_lk.txt (保存每个位点的似然值)
--print_trace: print each phylogeny during tree search process in file *_phyml_trace.txt

3. Figtree tutorial

4. Exercises

4.1 Exercise 1 - Parameter estimation

Dataset: primates-nt.phy

Inferring phylogenies using maximum likelihood

In this exercise we need to run PhyML twice in order to compare the effect of

estimating nt frequencies from the used dataset versus
optimizing the frequencies via maximum likelihood

First run: Set the model to HKY85+Gamma, estimating the transition/transversion (ts/tv ratio) and the alpha parameter of the Gamma distribution by maximum likelihood, nucleotide frequencies are estimated by ML.
Second run: Set the model to HKY85+Gamma, estimating the ts/tv ratio and alpha parameter by ML, however nucleotide frequencies are estimated empirically from the data.
Answer the following questions:

Do you see much difference in the resulting tree?
Do you observe much difference in the likelihood value (in stat file)?
Which option is better and why do you think so?

4.2 Exercise 2 - Tree topologies

In this exercise you will learn to optimize the tree topology on the substitution paramters obtained using ML performing a tree search (i.e., NNI, SPR, TBR) on the initial tree topology:

Compare the ML and the BioNJ trees and the model estimates (HKY85+Gamma) obtained for the two trees.
Compare the likelihood of the ML and BioNJ trees.
What do you observe and why?

4.3 Exercise 3 - Model comparision

In this exercise, you are asked to infer the phylogenetic tree on the same dataset using different subtitution models (and their variations). Use now GTR+Gamma, JC69+Gamma, GTR, HKY85 and JC69.

First run: Run PhyMl with the substitution model set to GTR, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.

GTR
GTR+Gamma with 4 discrete classes, estimate the shape parameter.
GTR+I: Adding invariable sites option
GTR+Gamma+I: with 4 classes, estimate alpha, and adding invariable sites

Second run: Run PhyML with the substitution model set to HKY85, estimating the ts/tv ratio, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.

HKY85
HKY85+Gamma: Gamma with 4 classes; Estimating alpha
HKY85+I: Adding invariable sites
HKY85+Gamma+I: Gamma distribution with 4 classes; Estimating Gamma alpha parameter; Adding invariable sites

Third run: Run PhyML with the substitution model set to JC69 and executing the tree search routines.

JC69
JC69+Gamma: Gamma with 4 classes; Estimating alpha
JC69+I: Adding invariable sites option
JC69+Gamma+I: Gamma sith 4 classes; Estimating Gamma shape; Adding invariable site

4.4 Exercise 4 - Branch support

In this exercise, you need to compute the branch support using 2 different approaches:

Nonparametric ML,
SH-aLRT

Compare supports inferred by nonparametric ML bootstrap to those obtain using SH-aLRT method. Are the results compatible?

4.5 Exercise 5 - Command line

Rewrite the command lines for the above exercises.

4.6 Exercise 6 - Display the tree

Display the above trees in a nice and meaningful way using Figtree.

4.7 (Extra)Exercsie 7 - Create a pipeline for phylogenetic reinference

MSA alignment is inferred with MUSCLE
FASTA files are converted into phylip format using fasta2phylip
Phylogenetic tree consutruction is performed using GTR+Gamma