Lab 6: Phylogenetic Reinference via Maximum Likelihood - ricket-sjtu/bioinformatics GitHub Wiki
Learning Objectives
After this lab, you will
- Grasp the underlying theory of maximum likelihood approach for phylogenetic inference
- Know how to choose the evolutionary models
- Understand the input and output of phylogenetic analyses
- Use
phyml
for conducting phylogenetic reinference - Draw trees using
figtree
.
1. PhyML Tutorial
1.1 Menu
First menu
[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis
[D] .......................... Data type (DNA/AA/Generic) DNA
[I] ......... Input sequences interleaved (or sequential) interleaved
[M] .......................... Analyze multiple data sets no
[R] .............................................. Run ID none
Second menu
[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis
[M] .................... Model of nucleotide substitution HKY85
[F] .................... Optimise equilibrium frequencies yes
[T] ....................... Ts/tv ratio (fixed/estimated) estimated
[V] .... Proportion of invariable sites (fixed/estimated) fixed (p-invar=0.00)
[R] .......... One category of substitution rate (yes/no) no
[C] .............. Number of substitution rate categories 4
[G] ...... Gamma distribution parameter (fixed/estimated) estimated
[O] .............................. Optimise tree topology yes
[U] ........... Starting tree (BioNJ/parsimony/user tree) BioNJ
[S] ..................... Tree toloLOGy search operations NNI moves (fast, approximate)
[B] ................... Non parametric bootstrap analysis no
[A] ................... Approximate likelihood ratio test yes / SH-like supports
1.2 Model of nucleotide substitution
- JC69
- K80
- F81
- HKY85
- F84
- TN93
- GTR
- Custom (*):
000000
,012345
1.3 Equilibrium frequencies of nucletides
- Estimated empirically from the used datasets
- Optimized with a maximum likelihood (ML) approach
1.4 Transition/Transversion (ts/tv) ratio
- Fixed
- Estimated through maximum likelihood
1.5 Proportion of invariable sites
- Estimated
- Fixed (.01)
1.6 Tree topology search strategy
- NNI moves (quick, approximate): nearest neighbor interchange
- SPR moves (slow, accurate): subtree pruning and regrafting
- Best of NNI and SPR
1.7 Non-parametric bootstrap analysis
- no
- Number of replicates
1.8 Approximate likelihood ratio test
- aLRT statistics
- Chi2-based supports
- aBayes supports
2. PhyML Command Line
You can type in
phyml --help
to see the following phyml
manual:
-i|--input seq_file_name
:nt
oraa
file inPHYLIP
format; (输入MSA文件)-d|--datatype data_type
:nt
- nucleotide (default);aa
- amino acids;generic
(数据类型)-q|--sequential
: changesinterleaved
format (default) tosequential
format (MSA的存在方式)-n|--multiple nb_data_sets
: number of datasets to analyze-p|--pars
: useminimum parsimony starting tree
(用最大简约树作为初始树)-b|--bootstrap int
: number of bootstrap replicates (重采样次数)>0
: number of bootstrap0
: neither aLRT nor boostrap values are computed-1
: aLRT statistics for aLRT test-2
: Chi2-based parametric branch supports for aLRT test-4
: (default) SH-like branch supports alone-5
: approximate Bayes branch supports
-m|--model model
: substitution model name- nt-based model:
HKY85
(default),JC69
,K80
,F81
,F84
,TN93
,GTR
,custom (*)
- aa-based model:
LG
(default),WAG
,JTT
,MtREV
,Dayhoff
,DCMut
,RtREV
,CpREV
,VT
,Blosum62
,MtMam
,MtArt
,HIVw
,HIVb
,custom
- nt-based model:
--aa_rate_file filename
: aa subtitution rate matrix in PAML format.(-f e|m) | (fA,fC,fG,fT)
: __e__mpirical base/aa frequencies, estimated by __m__l(稳态核苷酸比例)-t|--ts/tv ratio
: transition/transversion ratio,m
to optimize,float
to fixed.(转换/颠换速率比)-v|--pinv prop_inv
: proportion of invariable sites (保守位点的比例)-c|--nclasses nb_subst_cat
: number of relative substitution rate categories, default 4. (相对替换速率的分类)-a|--alpha alpha
: gamma shape parameter,e
to mle, or fixed positive value. (伽马分布的参数alpha)-s|--search move
: NNI|SPR|BEST (搜索树拓扑结构的方法)-u|--inputtree user_tree_file
: starting tree file (初始树文件) in Newick format-o params
: optimizes specific parameters:tlr
: tree topology (t, 拓扑结构) + branch length (l) + rate parameter (r)tl
: tree topology + branch lengthlr
: branch length (枝长) + rate parameters (速率参数)l
: branch lengthr
: rate parametersn
: no parameter
--rand_start
: sets the initial tree to random, only for SPR search (初始树随机产生)n_rand_starts num
: number of initial trees to be used. (随机初始树的个数)--r_seed num
: random number generator (随机数发生器的seed)--print_site_lnl
: print the likelihood for each site in file*_phyml_lk.txt
(保存每个位点的似然值)--print_trace
: print each phylogeny during tree search process in file*_phyml_trace.txt
3. Figtree tutorial
4. Exercises
4.1 Exercise 1 - Parameter estimation
- Dataset: primates-nt.phy
Inferring phylogenies using maximum likelihood
In this exercise we need to run PhyML twice in order to compare the effect of
- estimating nt frequencies from the used dataset versus
- optimizing the frequencies via maximum likelihood
- First run: Set the model to
HKY85+Gamma
, estimating the transition/transversion (ts/tv
ratio) and thealpha
parameter of the Gamma distribution by maximum likelihood,nucleotide frequencies
are estimated by ML. - Second run: Set the model to
HKY85+Gamma
, estimating thets/tv
ratio andalpha
parameter by ML, however nucleotide frequencies are estimated empirically from the data. - Answer the following questions:
- Do you see much difference in the resulting tree?
- Do you observe much difference in the likelihood value (in stat file)?
- Which option is better and why do you think so?
4.2 Exercise 2 - Tree topologies
In this exercise you will learn to optimize the tree topology on the substitution paramters obtained using ML performing a tree search (i.e., NNI, SPR, TBR) on the initial tree topology:
- Compare the ML and the BioNJ trees and the model estimates (HKY85+Gamma) obtained for the two trees.
- Compare the likelihood of the ML and BioNJ trees.
- What do you observe and why?
4.3 Exercise 3 - Model comparision
In this exercise, you are asked to infer the phylogenetic tree on the same dataset using
different subtitution models (and their variations). Use now GTR+Gamma
, JC69+Gamma
,
GTR
, HKY85
and JC69
.
- First run: Run PhyMl with the substitution model set to
GTR
, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
GTR
GTR+Gamma
with 4 discrete classes, estimate the shape parameter.GTR+I
: Adding invariable sites optionGTR+Gamma+I
: with 4 classes, estimate alpha, and adding invariable sites
- Second run: Run PhyML with the substitution model set to
HKY85
, estimating thets/tv
ratio, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
HKY85
HKY85+Gamma
: Gamma with 4 classes; Estimating alphaHKY85+I
: Adding invariable sitesHKY85+Gamma+I
: Gamma distribution with 4 classes; Estimating Gamma alpha parameter; Adding invariable sites
- Third run: Run PhyML with the substitution model set to
JC69
and executing the tree search routines.
JC69
JC69+Gamma
: Gamma with 4 classes; Estimating alphaJC69+I
: Adding invariable sites optionJC69+Gamma+I
: Gamma sith 4 classes; Estimating Gamma shape; Adding invariable site
4.4 Exercise 4 - Branch support
In this exercise, you need to compute the branch support using 2 different approaches:
- Nonparametric ML,
- SH-aLRT
Compare supports inferred by nonparametric ML bootstrap to those obtain using SH-aLRT method. Are the results compatible?
4.5 Exercise 5 - Command line
Rewrite the command lines for the above exercises.
4.6 Exercise 6 - Display the tree
Display the above trees in a nice and meaningful way using Figtree.
4.7 (Extra)Exercsie 7 - Create a pipeline for phylogenetic reinference
- MSA alignment is inferred with
MUSCLE
- FASTA files are converted into phylip format using
fasta2phylip
- Phylogenetic tree consutruction is performed using
GTR+Gamma