Overview - abacus-gene/paml GitHub Wiki
General overview
PAML documentation
Besides this manual, please note that you can always consult the following additional resources:
- Ziheng Yang Lab's website: this website has information about downloading and compiling
PAMLprograms too. PAMLFAQ page: document that compiles various FAQs sincePAML4 was released. Last update: 2005/01/05.PAMLdiscussion group: if you have any questions with regards to usingPAMLprograms, please post them on this discussion Google group, do not open new issues on this GitHub repository. The latter should strictly be used for technical problems withPAMLprograms.
What PAML programs can do
The PAML package currently includes the following programs: BASEML, basemlg, CODEML, evolver, pamp, yn00, MCMCtree, and chi2. A brief overview of the most commonly used models and methods implemented in PAML is provided by Yang (2007). The book Yang (2006) describes the statistical and computational details. Examples of analyses that can be performed using the package include the following:
- Comparison and tests of phylogenetic trees (
BASEMLandCODEML). - Estimation of parameters in sophisticated substitution models, including models of variable rates among sites and models for combined analysis of multiple genes or site partitions (
BASEMLandCODEML). - Likelihood ratio tests (LRTs) of hypotheses through comparison of implemented models (
BASEML,CODEML,chi2). - Estimation of divergence times under global and local clock models (
BASEMLandCODEML). - Likelihood (Empirical Bayes) reconstruction of ancestral sequences using nucleotide, amino acid, and codon models (
BASEMLandCODEML). - Generation of datasets of nucleotide, codon, and amino acid sequence by Monte Carlo simulation (
evolver). - Estimation of synonymous and nonsynonymous substitution rates and detection of positive selection in protein-coding DNA sequences (
yn00andCODEML). - Bayesian estimation of species divergence times incorporating uncertainties in fossil calibrations (
MCMCtree).
The strength of PAML is its collection of sophisticated substitution models. Tree search algorithms implemented in BASEML and CODEML are rather primitive, so except for very small datasets with say, <10 species, you are better off using another software such as raxml-ng, IQ-TREE, PhyloBayes, or MrBayes to infer the tree topology/ies, which you can then evaluate using BASEML or CODEML as input tree/s.
BASEMLandCODEML: The programBASEMLis for maximum likelihood analysis of nucleotide sequences. The programCODEMLis formed by merging two old programs:codonml, which implements the codon substitution model of Goldman and Yang (1994) for protein-coding DNA sequences, andaaml, which implements models for amino acid sequences. These two are now distinguished by the variableseqtypein the control filecodeml.ctl, with1for codon sequences and2for amino acid sequences. In this document, I usecodonmlandaamlto refer toCODEMLwithseqtype = 1andseqtype = 2, respectively. The programsBASEMLandCODEMLuse similar algorithms to fit models by maximum likelihood, the main difference being that the unit of evolution in the Markov model, referred to as a "site" in the sequence, is a nucleotide, a codon, or an amino acid for the three programs, respectively. Markov process models are used to describe substitutions between nucleotides, codons, or amino acids, with substitution rates assumed to be either constant or variable among sites.evolver: This program can be used to simulate sequences under nucleotide, codon, and amino acid substitution models. It also has some other options such as generating random trees and calculating the partition distances (Robinson and Foulds 1981) between trees.basemlg: This program implements the (continuous) gamma model of Yang (1993). It is very slow and unfeasible for data of more than 6 or 7 species. Instead, the discrete-gamma model inBASEMLdescribed in Yang (1994) should be used.MCMCtree: This program implements the Bayesian MCMC algorithm of Yang and Rannala (2006) and Rannala and Yang (2007) for estimating species divergence times.pamp: This program implements the parsimony-based analysis of Yang and Kumar (1996).yn00: This program implements the method of Yang and Nielsen (2000) for estimating synonymous and nonsynonymous substitution rates (dS and dN) in pairwise comparisons of protein-coding DNA sequences.chi2: This calculates the $\chi_{2}$ critical value and p-value for conducting the likelihood ratio test. Run the program by typing its name:chi2. Once you do this, the software will print out the critical values for different d.f. (for example, the 5% critical value with d.f. = 1 is 3.84). If you run the program with one command-line argument, the program enters a loop to ask you to input the d.f. and the test statistic and then calculates the p-value. A third way of running the program from the command line is to include the d.f. and test statistic both as command-line argument. For instance:
chi2
chi2 p
chi2 1 3.84
What PAML programs cannot do
There are many things that you might well expect a phylogenetics package should do, but PAML cannot. Below, you can find a partial list of such limitations, provided in the hope that it might help you avoid wasting time.
- Sequence alignment: You should use some other programs such as
Muscle5,mafft, orBAli-Phy(just to name a few, there are many more you can use!) to align the sequences automatically. Manual adjustment does not seem to have reached the mature stage to be entirely trustable, so you should always do that with care. If you are constructing thousands of alignments in genome-wide analysis, you should implement some quality control, and, say, calculate some measure of sequence divergence as an indication of the unreliability of the alignment. For coding sequences, you might align the protein sequences and construct the DNA alignment based on the protein alignment. Note that, ifcleandata = 0, both ambiguity characters and alignment gaps are treated as ambiguity characters inBASEMLandCODEML. Ifcleandata = 1, all sites with ambiguity characters and alignment gaps are removed from all sequences before analysis. - Gene prediction: The codon-based analysis implemented in
CODEML(seqtype = 1) assumes that the sequences are pre-aligned exons, the sequence length is an exact multiple of 3, and the first nucleotide in the sequence is codon position 1. Introns, spacers, and other non-coding regions must be removed and the coding sequences must be aligned before running the program. The program cannot process sequences downloaded directly from GenBank, even though the CDS information is there, nor predict coding regions. - Tree search in large data sets: As mentioned earlier, you should use another program to get a tree or some candidate trees and use them as user trees to fit models that might not be available in other packages.
Running PAML programs
Before running a PAML program, please make sure that you have followed the installation instructions according to your operating system. When PAML programs are exported to the system's path, you can run a program by typing its name from the command line. If your working directory is not the same where you have your sequence file, tree file, and control file, you should know the relative/absolute path to such folder. If inexperienced and/or you are having issues to export paths (see Installation.md for tips on how to do this for different operating systems), you may copy the relevant executable file to the folder containing your data files, and run the PAML program from this folder.
[!NOTE] When running
CODEML, please note that you may need a data file such asgrantham.dat,dayhoff.dat,jones.dat,wag.dat,mtREV24.dat,mtmam.dat, etc.; so you should copy these files as well in the same directory where you have your input files and control file (and add the corresponding name in variableaaRatefilein the control file!). You can find these files in thedatdirectory, which you will have access from your file system once you clone the repository or download the latest release. Alternatively, you can always type the relative path to the file you want to use in variableaaRatefile.
[!IMPORTANT] Some PAML programs produce result files such as as
rub,lnf,rst, orrates. You should not use these names (or other names that PAML programs use to create output files) for your own files. Otherwise, they will be overwritten!
Example data sets
The examples/ folder contains many example data sets. They were used in the original papers to test the new methods, and I included them so that you could duplicate our results in the papers. Sequence alignments, control files, and detailed readme files are included. They are intended to help you get familiar with the input data formats and with interpretation of the results, and also to help you discover bugs in the program. If you are interested in a particular analysis, get a copy of the paper that described the method and analyse the example dataset to duplicate the published results. This is particularly important because the manual, as it is written, describes the meanings of the control variables used by the programs but does not clearly explain how to set up the control file to conduct a particular analysis.
examples/HIVNSsites/: This folder contains example data files for the HIV-1 env V3 region analysed in Yang et al. (2000b). The data set is for demonstrating theNSsitesmodels described in that paper, that is, models of variable $\omega$ ratios among amino acid sites. Those models are called the “random-sites” models by Yang & Swanson (2002) since a priori we do not know which sites might be highly conserved and which under positive selection. They are also known as “fishing-expedition” models. The included data set is the 10th data set analysed by Yang et al. (2000b), and the results are in table 12 of that paper. Look at the README.txt file in that folder.examples/lysin/: This folder contains the sperm lysin genes from 25 abalone species analysed by Yang, Swanson & Vacquier (2000a) and Yang and Swanson (2002). The data set is for demonstrating both the “random-sites” models (as in Yang, Swanson & Vacquier (2000a)) and the “fixed-sites” models (as in Yang and Swanson (2002)). In the latter paper, we used structural information to partition amino acid sites in the lysin into the “buried” and “exposed” classes and assigned and estimated different $\omega$ ratios for the two partitions. The hypothesis is that the sites exposed on the surface are likely to be under positive selection. Look at the README.txt file in that folder.examples/lysozyme/: This folder contains the primate lysozyme c genes of Messier and Stewart (1997), re-analysed by Yang (1998). This is for demonstrating codon models that assign different $\omega$ ratios for different branches in the tree, useful for testing positive selection along lineages. Those models are sometimes called branch models or branch-specific models. Both the “large” and the “small” data sets in Yang (1998) are included. Those models require the user to label branches in the tree, and the readme file and included tree file explain the format in great detail. See also the section “Tree file and representations of tree topology” later about specifying branch/node labels. The lysozyme data set was also used by Yang and Nielsen (2002) to implement the so-called “branch-site” models, which allow the $\omega$ ratio to vary both among lineages and among sites. Look at the README.txt file to learn how to run those models.examples/MouseLemurs/: This folder includes the mtDNA alignment that Yang and Yoder (2003) analysed to estimate divergence dates in mouse lemurs. The data set is for demonstrating maximum likelihood estimation of divergence dates under models of global and local clocks. The most sophisticated model described in that paper uses multiple calibration nodes simultaneously, analyses multiple genes (or site partitions) while accounting for their differences, and also account for variable rates among branch groups. The README.txt file explains the input data format as well as model specification in detail. The README2.txt file explains the ad hoc rate smoothing procedure of Yang (2004).examples/mtCDNA/: This folder includes the alignment of 12 protein-coding genes on the same strand of the mitochondrial genome from seven ape species analysed by Yang, Nielsen, & Hasegawa (1998) under a number of codon and amino acid substitution models. The data set is the “small” data set referred to in that paper, and was used to fit both the “mechanistic” and empirical models of amino acid substitution as well as the “mechanistic” models of codon substitution. The model can be used, for example, to test whether the rates of conserved and radical amino acid substitutions are equal. See the README.txt file for details.examples/TipDate.HIV2/: This folder includes the alignment of 33 SIV/HIV-2 sequences, compiled and analysed by Lemey et al. (2003) and re-analysed by Stadler and Yang (2013). The README.txt file explains how to duplicate the ML and Bayesian results published in that paper. Note that the sample date is the last field in the sequence name.
Some other data files are included in the package as well. The details follow:
brown.nucandbrown.trees: the 895-bp mtDNA data of Brown et al. (1982), used in Yang et al. (1994) and Yang (1994b) to test models of variable rates among sites.mtprim9.nucand9s.trees: mitochondrial segment consisting of 888 aligned sites from 9 primate species (Hayasaka et al. 1988), used by Yang (1994b) to test the discrete-gamma model and Yang (1995) to test the auto-discrete-gamma models.abglobin.nucandabglobin.trees: the concatenated $\alpha$- and $\beta$-globin genes, used by Goldman and Yang (1994) in their description of the codon model.abglobin.aais the alignment of the translated amino acid sequences.stewart.aaandstewart.trees: lysozyme protein sequences of six mammals (Stewart et al. 1987), used by Yang et al. (1995b) to test methods for reconstructing ancestral amino acid sequences.