Control file format and options - bpp/bpp-tutorial-geneflow GitHub Wiki
Imap file
The imap file consists of two-columns and is used to assign the samples to species (surjective mapping). In the first column of the file are the names of the samples followed by the species name they correspond to. Each sample name needs to be the same across loci.
The Imap file format is as follows:
SampleName1 SpeciesName
SampleName2 SpeciesName
SampleName3 SpeciesName
SampleName4 SpeciesName
SampleName5 SpeciesName
.
.
The delimiter between SampleName
and SpeciesName
may be space(s), tab(s) or a mixture of the two characters.
See the Imap file for the Baobab dataset.
Sequence file
BPP requires a single input file containing all aligned loci for a dataset. In this file, the alignments (in PHYLIP format) are arranged sequentially, one below the other, unlike other phylogenetic software that typically concatenate them side by side.
The sequence file format is as follows:
[#SEQUENCES_LOCUS_1] [#ALIGNMENT_SITES_LOCUS_1]
SequenceNameA^SampleName1 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameB^SampleName2 GGAGCCAACAGAGTTTAACATTCT...
SequenceNameC^SampleName3 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameD^SampleName4 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameE^SampleName5 GGAGCCAACAGAGTTTAACGTTCT...
.
.
[#SEQUENCES_LOCUS_2] [#ALIGNMENT_SITES_LOCUS_2]
SequenceNameA^SampleName1 TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName2 TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName3 TCCCTTTCTCRGGCATTG...
SequenceNameC^SampleName4 TCCCTTTCTCGGGCATTG...
SequenceNameD^SampleName5 TCCCTTTCTCGGGCATTG...
.
.
The SampleNameX
tags refer to the particular individual animals sequenced for this locus and correspond to the SampleNameX
in the Imap file. The SequenceNameX
labels may be codes for the particular sequence and are ignored by the program.
Omitting sequence labels also constitutes a valid format:
[#SEQUENCES_LOCUS1] [#ALIGNMENT_SITES_LOCUS1]
^SampleName1 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName2 GGAGCCAACAGAGTTTAACATTCT...
^SampleName3 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName4 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName5 GGAGCCAACAGAGTTTAACGTTCT...
.
.
See an example sequence file consisting of two loci for the Baobab dataset.
Control file
# Seed for the random number generator (-1 selects a random seed)
seed = -1
# input and output files
seqfile = baobab.phy # sequence file (per-locus alignments)
Imapfile = baobab.map # assignments of samples to species
jobname = bb1 # serves as the prefix for all output filenames, including MCMC sample file and output log file
# Selection of analysis type by setting options "speciesdelimitation" and "speciestree"
# A00 - estimation of parameters on fixed phylogeny
# A10 - species delimitation using a guide tree
# A01 - species tree inference
# A11 - joint species tree inference and species delimitation
# enable species delimitation (two available algorithms):
speciesdelimitation = 0 # species delimitation disabled (default)
# speciesdelimitation = 1 0 2 # species delimitation algorithm 0 finetune (e)
# speciesdelimitation = 1 1 2 0.5 # species delimitation algorithm 1 finetune (a m)
# enable species tree inference
speciestree = 1
# species model prior (four potential priors: 0,1,2,3)
# Method A01 (species tree inference) uses speciesmodelprior = 1
# Method A10 (species delimitation with a fixed guide tree) uses either speciesmodelprior=0,1
# Method A11 (joint species delimitation and species tree inference) uses either speciesmodelprior=0,1,2,3
speciesmodelprior = 1 * 0: uniform labeled histories; 1:uniform rooted trees
# specification of: # of species, whitespace-separated list of species
# followed by max number of sequences for each species at a locus,
# and a starting species tree (fixed for A00 and A10)
species&tree = 6 Adig Agra Agre Amad Arub Smic
3 2 1 2 2 1
(((Agra, (Smic,Amad)),(Adig,Agre)),Arub);
usedata = 1 # 0: do no use data (prior); 1: use sequence data
nloci = 100 # number of data sets (alignments) in seqfile
cleandata = 0 # remove sites with ambiguity data (1:yes, 0:no)?
phase = 0 0 0 0 0 0 # phase data
# gamma and inverse gamma priors on thetas and root tau
thetaprior = 3 0.02 # Inverse-Gamma(a, b) for thetas
tauprior = 3 0.07 # Inverse-Gamma(a, b) for root tau
# equivalently, we could have used:
# thetaprior = invgamma 3 0.02
# tauprior = invgamma 3 0.07
# gamma priors are also available
#thetaprior = gamma 3 300 # Gamma(a,b) for theta
#tauprior = gamma 3 85 # Gamma(a,b) for root tau
# auto-adjust step lengths during burnin (1: yes, 0: no)
# Potentially, followed by a list of key:val entries, where
# key specifies the step length and val the corresponding value
# Default values are used for missing step length entries.
finetune = 1
# finetune = 1 gage:5 gspr:0.001 th1:0.001 tau:0.001 mix:0.3
# binary flags on what to log
print = 1 0 0 0 # MCMC samples, locusrate, heredity scalars, Gene trees
# MCMC chain information
# Total chain length is: burnin + sampfreq*nsample
# First burnin samples are discarded, then we log every sampfreq-th sample
burnin = 50000 # discard first 50000 steps
sampfreq = 5 # log sample every 5th step (after burnin)
nsample = 100000 # number of samples to log in mcmcfile
# threads
# threads = 2 # threads = threads_count
threads = 2 1 1 # threads = threads_count starting_core step
For more information please check the BPP Documentation project or the BPP Manual
Priors on theta and tau
All $\theta$ parameters (both modern and extinct ancestral species) in the MSC model are assigned one of the following four available priors:
thetaprior = gamma a b # Gamma(a,b)
thetaprior = invgamma a b # Inverse-Gamma(a,b) with estimation of theta
thetaprior = invgamma a b int # Inverse-Gamma(a,b) with integrated-out theta
The prior for the the root node $\tau$ is:
tauprior = gamma a b # Gamma(a,b)
tauprior = invgamma a b # Inverse-Gamma(a,b)
Inverse-Gamma is used by default if the distribution is not specified (e.g. thetaprior = a b)
In our demo, we use a diffuse prior by setting the shape $a=2$, and we adjust the rate parameter ($b$) to reflect the mean ($a/b$).
You can use for example R to visualize the distributions and to calculate the 95% prior interval:
# For gamma distribution
a <- 3; b <- 300
curve(dgamma(x,a,b), from=0, to=0.03)
qgamma(c(0.025, 0.975), a, b) # calculates the 95% interval
# For inverse gamma distribution
# install invgamma packages if not installed yet
install.packages("invgamma")
library("invgamma")
a <- 3; b <- 0.002;
curve(dinvgamma(x,a,b), from=0, to=0.01)
qinvgamma(c(0.025, 0.975), a, b) # calculates the 95% interval
Small $\alpha$ values (e.g. 2,3) represent diffuse priors while larger values like 10,20,100 represent informative priors.
Species Tree Inference with BPP
Next:Species Tree Inference with Astral | BPP assumptions | BPP control file | Species Tree Inference with BPP | Parameter Estimation with BPP