Control File For Species Tree Inference - Pas-Kapli/bpp-tutorial GitHub Wiki

A detailed description of the variables in the control file is provided in the program's documentation. Here, we focus on a subset of the available parameters that are relevant to species tree inference.

Basic information:

  1. Blank lines are not interpreted by the program you can add to add clarity among parameter sections.

  2. Anything that follows an * in the same line is not interpreted by the software, it can be used to add comments or comment out a parameter.

  3. The control file is read line by line starting from the first on and until the end. If the same parameter is given twice with different options each time, the second one overwrites the first one.

Customizing the control file for our analysis

For the purposes of the demo we will gradually build the control file for species tree inference. However, the safest strategy when setting up an analysis is to start from an existing one and customize it to our data. Example control files for all types of analysis are provided with the BPP software download in the folder "data".

Seed

In the first line of the control file, we always provide the Seed, which is a random number for initializing the MCMC sampling. Using the same seed in different runs allows us to repeat the analysis in the same exact way. We can either provide a number of our choice or use seed = -1 to pick a random number.

Input/Output file names

Next, we provide the name of the file containing the sequences, the Imap file, as well as the name for the output information and the posterior sample as follows.

If all or some of the files are not in the current working directory we can provide complete paths before the file name. For example:

  seed =  -1

  seqfile = bpp_seqfile.txt
  Imapfile = Imap.txt
  outfile = out.txt
  mcmcfile = mcmc.txt

Analysis mode

To activate the species tree search with a fixed assignment of samples to species we add two lines in the control file:

  seed =  -1

  seqfile = bpp_seqfile.txt
  Imapfile = Imap.txt
  outfile = out.txt
  mcmcfile = mcmc.txt
  
  speciesdelimitation = 0
  speciestree = 1

Species Tree Prior

For the speciesmodelprior we have two options

  1. speciesmodelprior = 0: Equal probabilities for labeled histories
  2. speciesmodelprior = 1: Equal probabilities for rooted trees

In our example, we will use the second one:

  seed =  -1

  seqfile = bpp_seqfile.txt
  Imapfile = Imap.txt
  outfile = out.txt
  mcmcfile = mcmc.txt
  
  speciesdelimitation = 0
  speciestree = 1
  
  speciesmodelprior = 1

Species, starting tree and data

In the next set of parameters, we define:

  1. species&tree: the number of species followed by their names, in the following line the number of samples per species and in the following a rooted binary tree with the species defined above. This tree serves as the starting point for the tree search.
  2. phase: this parameter has two options phase = 0 which means the sequences are phased and phase = 1 when they are not. The parameter is defined per species. In the frogs' example, all sequences for all four species are unphased.
  3. usedata: we set usedata = 1 to use the sequence data, if usedata = 0 the analysis runs under the prior.
  4. nloci: we set this to nloci = 4 if we want to use all loci for the particular example. If we set nloci = 2 the analysis will take into account the first two loci.
  5. cleandata: this parameter takes two options, cleandata=0 means that all alignment sites are taken into account, with cleandata=1 the sites with gaps or degenerate characters are removed before the analysis. For the frog data this doesn't make a big difference as most of the sites are complete and unambiguous.
  6. model: two model options are currently available in bpp JC69 and GTR, we will use JC69 for the example
  seed =  -1

  seqfile = bpp_seqfile.txt
  Imapfile = Imap.txt
  outfile = out.txt
  mcmcfile = mcmc.txt
  
  speciesdelimitation = 0
  speciestree = 1
  
  speciesmodelprior = 1

  species&tree = 4  K  C  L  H
                    9  7 14  2
                 ((K, C), (L, H));
  phase =   1  1  1  1
  nloci = 5
  cleandata = 0
  usedata = 1
  model = JC69

Theta and Root-Tau priors

Theta is an estimate of the population size measured in expected number of mutations per site. We assign a single prior following the Inverse Gamma distribution (IG) for both the extant and the ancestral taxa. We select a distribution such that the mean roughly corresponds to what we assume to be mean divergence within a population. If we have little prior information we choose a diffused distribution by setting the a to 3. For the frog data we assign thetaprior = 3 0.002.

In the species Tree inference, we are not usually interested in the estimate of theta, therefore, by setting "thetaprior = 3 0.002" we can integrate out analytically the theta parameters and we will not get a posterior estimate for them. As we will see in the second part of the exercise (A00 under MSci), we can also estimate the theta parameters.

The root tauprior, expresses our prior belief for he divergence we expect from the tips to the root of the phylogeny. Similar to the theta parameter it is also measured in expected number of mutations per site and it is modeled according to an inverse gamma distribution. For the frog data we assign tauprior = 3 0.004 which means we roughly expect the genomic diverge from the root to the tips to be around 0.2%.

  seed =  -1

  seqfile = bpp_seqfile.txt
  Imapfile = Imap.txt
  outfile = out.txt
  mcmcfile = mcmc.txt
  
  speciesdelimitation = 0
  speciestree = 1
  
  speciesmodelprior = 1

  species&tree = 4  K  C  L  H
                    9  7 14  2
                 ((K, C), (L, H));
  phase =   1  1  1  1
  nloci = 5
  cleandata = 0
  usedata = 1
  model = JC69

  !thetaprior = 3 0.002   # invgamma(a, b) for theta
  tauprior = 3 0.004    # invgamma(a, b) for root tau & Dirichlet(a) for other tau's

MCMC sampling parameters

Here we specify how many MCMC steps we want to perform.

  1. nsample specifies how many MCMC steps are written in the "mcmc.txt" file

  2. sampfreq is the frequency with which we write in the "mcmc.txt" file

  3. burnin is the initial MCMC samples before sampling takes place.

The total number of MCMC iterations performed in this example is 68000 (i.e., burnin + samplefreq*nsample)

There is no rule of how long a run should be, however, after running

  seed =  -1

  seqfile = bpp_seqfile.txt
  Imapfile = Imap.txt
  outfile = out.txt
  mcmcfile = mcmc.txt
  
  speciesdelimitation = 0   * fixed species delimitation
  speciestree = 1
  
  speciesmodelprior = 1  * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted

  species&tree = 4  K  C  L  H
                    9  7 14  2
                 ((K, C), (L, H));
  phase =   1  1  1  1
  nloci = 5         * number of data sets in seqfile
  cleandata = 0     * remove sites with ambiguity data (1:yes, 0:no)
  usedata = 1       * 0: no data (prior); 1:seq like
  model = JC69

  thetaprior = 3 0.002   # invgamma(a, b) for theta. Does not estimate theta
  tauprior = 3 0.004    # invgamma(a, b) for root tau & Dirichlet(a) for other tau's
  
  finetune = 1
  print = 1 0 0 0   * MCMC samples, locusrate, heredityscalars Genetrees
  burnin = 8000
  sampfreq = 2
  nsample = 30000

Download the control file from this link or with wget:

wget https://raw.githubusercontent.com/Pas-Kapli/bpp-tutorial/master/A01_Frogs/data/A01.bpp.ctl

Next, run the analysis and check the output files