Control File For Species Tree Inference - Pas-Kapli/bpp-tutorial GitHub Wiki
A detailed description of the variables in the control file is provided in the program's documentation. Here, we focus on a subset of the available parameters that are relevant to species tree inference.
Basic information:
-
Blank lines are not interpreted by the program you can add to add clarity among parameter sections.
-
Anything that follows an
*
in the same line is not interpreted by the software, it can be used to add comments or comment out a parameter. -
The control file is read line by line starting from the first on and until the end. If the same parameter is given twice with different options each time, the second one overwrites the first one.
Customizing the control file for our analysis
For the purposes of the demo we will gradually build the control file for species tree inference. However, the safest strategy when setting up an analysis is to start from an existing one and customize it to our data. Example control files for all types of analysis are provided with the BPP software download in the folder "data".
Seed
In the first line of the control file, we always provide the Seed, which is a random number for initializing the MCMC sampling. Using the same seed in different runs allows us to repeat the analysis in the same exact way. We can either provide a number of our choice or use seed = -1
to pick a random number.
Input/Output file names
Next, we provide the name of the file containing the sequences, the Imap file, as well as the name for the output information and the posterior sample as follows.
If all or some of the files are not in the current working directory we can provide complete paths before the file name. For example:
seed = -1
seqfile = bpp_seqfile.txt
Imapfile = Imap.txt
outfile = out.txt
mcmcfile = mcmc.txt
Analysis mode
To activate the species tree search with a fixed assignment of samples to species we add two lines in the control file:
seed = -1
seqfile = bpp_seqfile.txt
Imapfile = Imap.txt
outfile = out.txt
mcmcfile = mcmc.txt
speciesdelimitation = 0
speciestree = 1
Species Tree Prior
For the speciesmodelprior
we have two options
speciesmodelprior = 0
: Equal probabilities for labeled historiesspeciesmodelprior = 1
: Equal probabilities for rooted trees
In our example, we will use the second one:
seed = -1
seqfile = bpp_seqfile.txt
Imapfile = Imap.txt
outfile = out.txt
mcmcfile = mcmc.txt
speciesdelimitation = 0
speciestree = 1
speciesmodelprior = 1
Species, starting tree and data
In the next set of parameters, we define:
species&tree
: the number of species followed by their names, in the following line the number of samples per species and in the following a rooted binary tree with the species defined above. This tree serves as the starting point for the tree search.phase
: this parameter has two optionsphase = 0
which means the sequences are phased andphase = 1
when they are not. The parameter is defined per species. In the frogs' example, all sequences for all four species are unphased.usedata
: we setusedata = 1
to use the sequence data, ifusedata = 0
the analysis runs under the prior.nloci
: we set this tonloci = 4
if we want to use all loci for the particular example. If we setnloci = 2
the analysis will take into account the first two loci.cleandata
: this parameter takes two options,cleandata=0
means that all alignment sites are taken into account, withcleandata=1
the sites with gaps or degenerate characters are removed before the analysis. For the frog data this doesn't make a big difference as most of the sites are complete and unambiguous.model
: two model options are currently available in bpp JC69 and GTR, we will use JC69 for the example
seed = -1
seqfile = bpp_seqfile.txt
Imapfile = Imap.txt
outfile = out.txt
mcmcfile = mcmc.txt
speciesdelimitation = 0
speciestree = 1
speciesmodelprior = 1
species&tree = 4 K C L H
9 7 14 2
((K, C), (L, H));
phase = 1 1 1 1
nloci = 5
cleandata = 0
usedata = 1
model = JC69
Theta and Root-Tau priors
Theta is an estimate of the population size measured in expected number of mutations per site. We assign a single prior following the Inverse Gamma distribution (IG) for both the extant and the ancestral taxa. We select a distribution such that the mean roughly corresponds to what we assume to be mean divergence within a population. If we have little prior information we choose a diffused distribution by setting the a to 3. For the frog data we assign thetaprior = 3 0.002
.
In the species Tree inference, we are not usually interested in the estimate of theta, therefore, by setting "thetaprior = 3 0.002" we can integrate out analytically the theta parameters and we will not get a posterior estimate for them. As we will see in the second part of the exercise (A00 under MSci), we can also estimate the theta parameters.
The root tauprior, expresses our prior belief for he divergence we expect from the tips to the root of the phylogeny. Similar to the theta parameter it is also measured in expected number of mutations per site and it is modeled according to an inverse gamma distribution. For the frog data we assign tauprior = 3 0.004
which means we roughly expect the genomic diverge from the root to the tips to be around 0.2%.
seed = -1
seqfile = bpp_seqfile.txt
Imapfile = Imap.txt
outfile = out.txt
mcmcfile = mcmc.txt
speciesdelimitation = 0
speciestree = 1
speciesmodelprior = 1
species&tree = 4 K C L H
9 7 14 2
((K, C), (L, H));
phase = 1 1 1 1
nloci = 5
cleandata = 0
usedata = 1
model = JC69
!thetaprior = 3 0.002 # invgamma(a, b) for theta
tauprior = 3 0.004 # invgamma(a, b) for root tau & Dirichlet(a) for other tau's
MCMC sampling parameters
Here we specify how many MCMC steps we want to perform.
-
nsample
specifies how many MCMC steps are written in the "mcmc.txt" file -
sampfreq
is the frequency with which we write in the "mcmc.txt" file -
burnin
is the initial MCMC samples before sampling takes place.
The total number of MCMC iterations performed in this example is 68000 (i.e., burnin + samplefreq*nsample)
There is no rule of how long a run should be, however, after running
seed = -1
seqfile = bpp_seqfile.txt
Imapfile = Imap.txt
outfile = out.txt
mcmcfile = mcmc.txt
speciesdelimitation = 0 * fixed species delimitation
speciestree = 1
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
species&tree = 4 K C L H
9 7 14 2
((K, C), (L, H));
phase = 1 1 1 1
nloci = 5 * number of data sets in seqfile
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)
usedata = 1 * 0: no data (prior); 1:seq like
model = JC69
thetaprior = 3 0.002 # invgamma(a, b) for theta. Does not estimate theta
tauprior = 3 0.004 # invgamma(a, b) for root tau & Dirichlet(a) for other tau's
finetune = 1
print = 1 0 0 0 * MCMC samples, locusrate, heredityscalars Genetrees
burnin = 8000
sampfreq = 2
nsample = 30000
Download the control file from this link or with wget
:
wget https://raw.githubusercontent.com/Pas-Kapli/bpp-tutorial/master/A01_Frogs/data/A01.bpp.ctl