Lab 6: Phylogenetic Reinference via Maximum Likelihood - ricket-sjtu/bioinformatics GitHub Wiki
Learning Objectives
After this lab, you will
- Grasp the underlying theory of maximum likelihood approach for phylogenetic inference
- Know how to choose the evolutionary models
- Understand the input and output of phylogenetic analyses
- Use
phymlfor conducting phylogenetic reinference - Draw trees using
figtree.
1. PhyML Tutorial
1.1 Menu
First menu
[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis
[D] .......................... Data type (DNA/AA/Generic) DNA
[I] ......... Input sequences interleaved (or sequential) interleaved
[M] .......................... Analyze multiple data sets no
[R] .............................................. Run ID none
Second menu
[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis
[M] .................... Model of nucleotide substitution HKY85
[F] .................... Optimise equilibrium frequencies yes
[T] ....................... Ts/tv ratio (fixed/estimated) estimated
[V] .... Proportion of invariable sites (fixed/estimated) fixed (p-invar=0.00)
[R] .......... One category of substitution rate (yes/no) no
[C] .............. Number of substitution rate categories 4
[G] ...... Gamma distribution parameter (fixed/estimated) estimated
[O] .............................. Optimise tree topology yes
[U] ........... Starting tree (BioNJ/parsimony/user tree) BioNJ
[S] ..................... Tree toloLOGy search operations NNI moves (fast, approximate)
[B] ................... Non parametric bootstrap analysis no
[A] ................... Approximate likelihood ratio test yes / SH-like supports
1.2 Model of nucleotide substitution
- JC69
- K80
- F81
- HKY85
- F84
- TN93
- GTR
- Custom (*):
000000,012345
1.3 Equilibrium frequencies of nucletides
- Estimated empirically from the used datasets
- Optimized with a maximum likelihood (ML) approach
1.4 Transition/Transversion (ts/tv) ratio
- Fixed
- Estimated through maximum likelihood
1.5 Proportion of invariable sites
- Estimated
- Fixed (.01)
1.6 Tree topology search strategy
- NNI moves (quick, approximate): nearest neighbor interchange
- SPR moves (slow, accurate): subtree pruning and regrafting
- Best of NNI and SPR
1.7 Non-parametric bootstrap analysis
- no
- Number of replicates
1.8 Approximate likelihood ratio test
- aLRT statistics
- Chi2-based supports
- aBayes supports
2. PhyML Command Line
You can type in
phyml --help
to see the following phyml manual:
-i|--input seq_file_name:ntoraafile inPHYLIPformat; (输入MSA文件)-d|--datatype data_type:nt- nucleotide (default);aa- amino acids;generic(数据类型)-q|--sequential: changesinterleavedformat (default) tosequentialformat (MSA的存在方式)-n|--multiple nb_data_sets: number of datasets to analyze-p|--pars: useminimum parsimony starting tree(用最大简约树作为初始树)-b|--bootstrap int: number of bootstrap replicates (重采样次数)>0: number of bootstrap0: neither aLRT nor boostrap values are computed-1: aLRT statistics for aLRT test-2: Chi2-based parametric branch supports for aLRT test-4: (default) SH-like branch supports alone-5: approximate Bayes branch supports
-m|--model model: substitution model name- nt-based model:
HKY85(default),JC69,K80,F81,F84,TN93,GTR,custom (*) - aa-based model:
LG(default),WAG,JTT,MtREV,Dayhoff,DCMut,RtREV,CpREV,VT,Blosum62,MtMam,MtArt,HIVw,HIVb,custom
- nt-based model:
--aa_rate_file filename: aa subtitution rate matrix in PAML format.(-f e|m) | (fA,fC,fG,fT): __e__mpirical base/aa frequencies, estimated by __m__l(稳态核苷酸比例)-t|--ts/tv ratio: transition/transversion ratio,mto optimize,floatto fixed.(转换/颠换速率比)-v|--pinv prop_inv: proportion of invariable sites (保守位点的比例)-c|--nclasses nb_subst_cat: number of relative substitution rate categories, default 4. (相对替换速率的分类)-a|--alpha alpha: gamma shape parameter,eto mle, or fixed positive value. (伽马分布的参数alpha)-s|--search move: NNI|SPR|BEST (搜索树拓扑结构的方法)-u|--inputtree user_tree_file: starting tree file (初始树文件) in Newick format-o params: optimizes specific parameters:tlr: tree topology (t, 拓扑结构) + branch length (l) + rate parameter (r)tl: tree topology + branch lengthlr: branch length (枝长) + rate parameters (速率参数)l: branch lengthr: rate parametersn: no parameter
--rand_start: sets the initial tree to random, only for SPR search (初始树随机产生)n_rand_starts num: number of initial trees to be used. (随机初始树的个数)--r_seed num: random number generator (随机数发生器的seed)--print_site_lnl: print the likelihood for each site in file*_phyml_lk.txt(保存每个位点的似然值)--print_trace: print each phylogeny during tree search process in file*_phyml_trace.txt
3. Figtree tutorial
4. Exercises
4.1 Exercise 1 - Parameter estimation
- Dataset: primates-nt.phy
Inferring phylogenies using maximum likelihood
In this exercise we need to run PhyML twice in order to compare the effect of
- estimating nt frequencies from the used dataset versus
- optimizing the frequencies via maximum likelihood
- First run: Set the model to
HKY85+Gamma, estimating the transition/transversion (ts/tvratio) and thealphaparameter of the Gamma distribution by maximum likelihood,nucleotide frequenciesare estimated by ML. - Second run: Set the model to
HKY85+Gamma, estimating thets/tvratio andalphaparameter by ML, however nucleotide frequencies are estimated empirically from the data. - Answer the following questions:
- Do you see much difference in the resulting tree?
- Do you observe much difference in the likelihood value (in stat file)?
- Which option is better and why do you think so?
4.2 Exercise 2 - Tree topologies
In this exercise you will learn to optimize the tree topology on the substitution paramters obtained using ML performing a tree search (i.e., NNI, SPR, TBR) on the initial tree topology:
- Compare the ML and the BioNJ trees and the model estimates (HKY85+Gamma) obtained for the two trees.
- Compare the likelihood of the ML and BioNJ trees.
- What do you observe and why?
4.3 Exercise 3 - Model comparision
In this exercise, you are asked to infer the phylogenetic tree on the same dataset using
different subtitution models (and their variations). Use now GTR+Gamma, JC69+Gamma,
GTR, HKY85 and JC69.
- First run: Run PhyMl with the substitution model set to
GTR, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
GTRGTR+Gammawith 4 discrete classes, estimate the shape parameter.GTR+I: Adding invariable sites optionGTR+Gamma+I: with 4 classes, estimate alpha, and adding invariable sites
- Second run: Run PhyML with the substitution model set to
HKY85, estimating thets/tvratio, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
HKY85HKY85+Gamma: Gamma with 4 classes; Estimating alphaHKY85+I: Adding invariable sitesHKY85+Gamma+I: Gamma distribution with 4 classes; Estimating Gamma alpha parameter; Adding invariable sites
- Third run: Run PhyML with the substitution model set to
JC69and executing the tree search routines.
JC69JC69+Gamma: Gamma with 4 classes; Estimating alphaJC69+I: Adding invariable sites optionJC69+Gamma+I: Gamma sith 4 classes; Estimating Gamma shape; Adding invariable site
4.4 Exercise 4 - Branch support
In this exercise, you need to compute the branch support using 2 different approaches:
- Nonparametric ML,
- SH-aLRT
Compare supports inferred by nonparametric ML bootstrap to those obtain using SH-aLRT method. Are the results compatible?
4.5 Exercise 5 - Command line
Rewrite the command lines for the above exercises.
4.6 Exercise 6 - Display the tree
Display the above trees in a nice and meaningful way using Figtree.
4.7 (Extra)Exercsie 7 - Create a pipeline for phylogenetic reinference
- MSA alignment is inferred with
MUSCLE - FASTA files are converted into phylip format using
fasta2phylip - Phylogenetic tree consutruction is performed using
GTR+Gamma