Control file format and options - bpp/bpp-tutorial-geneflow GitHub Wiki

Imap file

The imap file consists of two-columns and is used to assign the samples to species (surjective mapping). In the first column of the file are the names of the samples followed by the species name they correspond to. Each sample name needs to be the same across loci.

The Imap file format is as follows:

SampleName1  SpeciesName
SampleName2  SpeciesName
SampleName3  SpeciesName
SampleName4  SpeciesName
SampleName5  SpeciesName
.
.

The delimiter between SampleName and SpeciesName may be space(s), tab(s) or a mixture of the two characters.

See the Imap file for the Baobab dataset.

Sequence file

BPP requires a single input file containing all aligned loci for a dataset. In this file, the alignments (in PHYLIP format) are arranged sequentially, one below the other, unlike other phylogenetic software that typically concatenate them side by side.

The sequence file format is as follows:

[#SEQUENCES_LOCUS_1] [#ALIGNMENT_SITES_LOCUS_1]

SequenceNameA^SampleName1     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameB^SampleName2     GGAGCCAACAGAGTTTAACATTCT...
SequenceNameC^SampleName3     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameD^SampleName4     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameE^SampleName5     GGAGCCAACAGAGTTTAACGTTCT...
.
.

[#SEQUENCES_LOCUS_2] [#ALIGNMENT_SITES_LOCUS_2]

SequenceNameA^SampleName1     TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName2     TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName3     TCCCTTTCTCRGGCATTG...
SequenceNameC^SampleName4     TCCCTTTCTCGGGCATTG...
SequenceNameD^SampleName5     TCCCTTTCTCGGGCATTG...
.
.

The SampleNameX tags refer to the particular individual animals sequenced for this locus and correspond to the SampleNameX in the Imap file. The SequenceNameX labels may be codes for the particular sequence and are ignored by the program.

Omitting sequence labels also constitutes a valid format:

[#SEQUENCES_LOCUS1] [#ALIGNMENT_SITES_LOCUS1]

^SampleName1     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName2     GGAGCCAACAGAGTTTAACATTCT...
^SampleName3     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName4     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName5     GGAGCCAACAGAGTTTAACGTTCT...
.
.

See an example sequence file consisting of two loci for the Baobab dataset.

Control file

# Seed for the random number generator (-1 selects a random seed)
seed =  -1

# input and output files
seqfile  = baobab.phy     # sequence file (per-locus alignments)
Imapfile = baobab.map     # assignments of samples to species
jobname  = bb1            # serves as the prefix for all output filenames, including MCMC sample file and output log file

# Selection of analysis type by setting options "speciesdelimitation" and "speciestree"
# A00 - estimation of parameters on fixed phylogeny
# A10 - species delimitation using a guide tree
# A01 - species tree inference
# A11 - joint species tree inference and species delimitation

# enable species delimitation (two available algorithms):
speciesdelimitation = 0             # species delimitation disabled (default)
# speciesdelimitation = 1 0 2       # species delimitation algorithm 0 finetune (e)
# speciesdelimitation = 1 1 2 0.5   # species delimitation algorithm 1 finetune (a m)

# enable species tree inference
speciestree = 1

# species model prior (four potential priors: 0,1,2,3)
# Method A01 (species tree inference) uses speciesmodelprior = 1
# Method A10 (species delimitation with a fixed guide tree) uses either speciesmodelprior=0,1
# Method A11 (joint species delimitation and species tree inference) uses either speciesmodelprior=0,1,2,3
speciesmodelprior = 1         * 0: uniform labeled histories; 1:uniform rooted trees

# specification of: # of species, whitespace-separated list of species
# followed by max number of sequences for each species at a locus,
# and a starting species tree (fixed for A00 and A10)
species&tree =  6  Adig Agra Agre Amad Arub Smic
                      3    2    1    2    2    1
                   (((Agra, (Smic,Amad)),(Adig,Agre)),Arub);


  usedata = 1      #  0: do no use data (prior); 1: use sequence data
    nloci = 100    # number of data sets (alignments) in seqfile

cleandata = 0      # remove sites with ambiguity data (1:yes, 0:no)?
phase = 0 0 0 0 0 0     # phase data

# gamma and inverse gamma priors on thetas and root tau
 thetaprior = 3 0.02           # Inverse-Gamma(a, b) for thetas
 tauprior   = 3 0.07           # Inverse-Gamma(a, b) for root tau

# equivalently, we could have used:
# thetaprior = invgamma 3 0.02  
# tauprior   = invgamma 3 0.07

# gamma priors are also available
#thetaprior = gamma 3 300     # Gamma(a,b) for theta
#tauprior   = gamma 3 85      # Gamma(a,b) for root tau

# auto-adjust step lengths during burnin (1: yes, 0: no)
# Potentially, followed by a list of key:val entries, where 
# key specifies the step length and val the corresponding value
# Default values are used for missing step length entries.
finetune = 1
# finetune = 1 gage:5 gspr:0.001 th1:0.001 tau:0.001 mix:0.3

# binary flags on what to log
print = 1 0 0 0    # MCMC samples, locusrate, heredity scalars, Gene trees

# MCMC chain information
# Total chain length is: burnin + sampfreq*nsample
# First burnin samples are discarded, then we log every sampfreq-th sample
  burnin = 50000       # discard first 50000 steps
sampfreq = 5           # log sample every 5th step (after burnin)
 nsample = 100000      # number of samples to log in mcmcfile

# threads
# threads = 2          # threads = threads_count
  threads = 2 1 1      # threads = threads_count starting_core step

For more information please check the BPP Documentation project or the BPP Manual

Priors on theta and tau

All $\theta$ parameters (both modern and extinct ancestral species) in the MSC model are assigned one of the following four available priors:

thetaprior = gamma a b          # Gamma(a,b) 
thetaprior = invgamma a b       # Inverse-Gamma(a,b) with estimation of theta
thetaprior = invgamma a b int   # Inverse-Gamma(a,b) with integrated-out theta

The prior for the the root node $\tau$ is:

tauprior = gamma a b           # Gamma(a,b)
tauprior = invgamma a b        # Inverse-Gamma(a,b)

Inverse-Gamma is used by default if the distribution is not specified (e.g. thetaprior = a b)

In our demo, we use a diffuse prior by setting the shape $a=2$, and we adjust the rate parameter ($b$) to reflect the mean ($a/b$).

You can use for example R to visualize the distributions and to calculate the 95% prior interval:

# For gamma distribution
a <- 3; b <- 300
curve(dgamma(x,a,b), from=0, to=0.03)
qgamma(c(0.025, 0.975), a, b)      # calculates the 95% interval

# For inverse gamma distribution

# install invgamma packages if not installed yet
install.packages("invgamma")
library("invgamma")
a <- 3; b <- 0.002;
curve(dinvgamma(x,a,b), from=0, to=0.01)
qinvgamma(c(0.025, 0.975), a, b)   # calculates the 95% interval

Small $\alpha$ values (e.g. 2,3) represent diffuse priors while larger values like 10,20,100 represent informative priors.

Next: Species Tree Inference with BPP

Species Tree Inference with Astral | BPP assumptions | BPP control file | Species Tree Inference with BPP | Parameter Estimation with BPP