Species tree inference (Baobab dataset) - bpp/bpp-tutorial-geneflow GitHub Wiki
For the purpose of this tutorial, our dataset is a subset of baobabs (Adansonia) from Karimi et al. (2020). In the original study they targeted 380 nuclear loci via hybrid enrichment but for practical reasons we will use a subset of 50 loci.

As an outgroup we use Scleronema micranthaum. Below is the subset of species along with the number of individuals. The dataset consists of 100 loci.
Species | Acronym | Sequences |
A. digitata | Adig | 3 |
A. gregorii | Agre | 1 |
A. grandidierii | Agra | 2 |
A. madagascariensis | Amad | 2 |
A. rubrostipa | Arub | 2 |
S. micranthaum | Smic | 1 |
If you are running BPP on your own PC please follow the Installation instructions to obtain a BPP binary for your operating system and computer architecture.
We will use the following folder structure:
`-- DAY-3
|-- baobab
| |-- A00
| | |-- baobab.A00.ctl
| | `-- baobab.A00.msci.ctl
| |-- A01
| | `-- baobab.A01.ctl
| |-- baobab.map.txt
| `-- baobab.phy
`-- BPP
`-- A01
|-- r1
| `-- baobab.A01.ctl
`-- r2
`-- baobab.A01.ctl
9 directories, 7 files
Below are the steps to create the folder structure, download the data files, and copy control files to the folders where we will run BPP:
# create folder structure
cd
mkdir -p DAY-3/BPP/A01/r{1,2}
# download and extract data files
cd DAY-3
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/main/data/baobab.tar.gz
tar zxvf baobab.tar.gz
rm baobab.tar.gz
# copy control files to the right folders
cd ~/DAY-3/BPP/A01/
cp ../../baobab/A01/baobab.A01.ctl r1
cp ../../baobab/A01/baobab.A01.ctl r2
You can check if your folder structure resembles the one described above using the commands:
cd ~/DAY-3
tree --charset=ascii
Now check that the BPP control files in the DAY-3/BPP/A01/r1
and DAY-3/BPP/A01/r2
point to the right PHYLIP (sequence alignment) and map files. They should have the following form:
seqfile = ../../../baobab/baobab.phy
Imapfile = ../../../baobab/baobab.map.txt
You can check by:
cd ~/DAY-3/BPP/A01/
cat r1/baobab.A01.ctl
cat r2/baobab.A01.ctl
You can use the available editors to open the file, e.g.
# vi/vim editor
vi r1/baobab.A01.ctl
#emacs
emacs r1/baobab.A01.ctl
# nano or pico
nano r1/baobab.A01.ctl
pico r1/baobab.A01.ctl
Note: In case you open the vi/vim editor accidentally and don't know how to get out, press ESC a few times and then type :q!
.
Finally, run bpp:
cd ~/DAY-3/BPP/A01/r1
bpp --cfile baobab.A01.ctl
If you run into problems with core pinning, or BPP exits with the error: "Error while pinning thread to core." then try running BPP with the --no-pin
option:
bpp --no-pin --cfile baobab.A01.ctl
Typically, we need to make at least two independent runs (with different starting seeds) in order to check for convergence. We will compare runs when they finish.
For each A01 run, BPP will create the following files:
Filename | Description |
---|---|
JOBNAME.mcmc.txt |
MCMC sample file |
JOBNAME.txt |
Output file (log of screen output) |
JOBNAME.SeedUsed |
Seed used for this run |
where JOBNAME
is the label provided in the jobname
control file option.
The output displayed on the screen is also saved in a file named according to the jobname
option specified in the control file, with a .txt
extension. For instance, if jobname = long-run
, the output file will be named long-run.txt
. See a sample output file for the Baobab dataset.
The first part of the output summarizes the input data. In particular there is a short description of the data for each locus, followed by the alignments in compressed form : i.e. sites with the same likelihood according to the specified substitution model are collapsed into a single site and a weight (number of times observed) is set for each such cluster of sites.
Analysis started at: Fri Nov 29 15:28:22 2024
Using BPP version: 4.8.2
pver sha1: 1a053a3b61831dc4fb680235a5f6e63e8158ea54
Command: bpp --cfile baobab.A01.ctl
Seed: 2051843109 (randomly generated)
Initial species tree: (((Agra, (Smic,Amad)),(Adig,Agre)),Arub);
Locus | Model | Sequences | Length | Ambiguous sites | Compressed | Base freqs
-------+-------+-----------+--------+-----------------+------------+------------
1 | JC69 | 10 | 960 | 12 | 19 | Fixed
2 | JC69 | 10 | 933 | 72 | 18 | Fixed
3 | JC69 | 10 | 5945 | 4532 | 103 | Fixed
.
.
.
98 | JC69 | 10 | 1110 | 231 | 34 | Fixed
99 | JC69 | 10 | 2487 | 1310 | 65 | Fixed
100 | JC69 | 10 | 1443 | 612 | 33 | Fixed
COMPRESSED ALIGNMENTS
10 19 P
^Age001 TGGTCACAAT TCCTTAXXX
^Adi002 TGGTCACAAT TTTCACXXX
.
.
^Ama018 TGGTCCGTAT TCCCATXXX
^Aga002 TAXTTACTAT TCCCATXGX
905 4 3 17 1 2 1 1 2 2 1 1 3 1 6 1 3 3 3
10 18 P
^Age001 CGCGTACCTG CAACTXXX
^Adi002 CGCGTACCTG TGGGAAAC
.
.
^Ama018 CGTGTGCCTC CAGGAAAC
^Aga002 CACATACCTG CAGGAAAC
777 1 2 5 25 3 1 1 1 1 1 2 1 39 1 70 1 1
.
.
.
Per-locus sequences in data and 'species&tree' tag:
C.File | Data | Status | Population
-------+------+--------------------------------------+-----------
3 | 2 | [OK] | Adig
2 | 2 | [OK] | Agra
1 | 1 | [OK] | Agre
2 | 2 | [OK] | Amad
2 | 2 | [OK] | Arub
1 | 1 | [OK] | Smic
Initial MSC density and log-likelihood of observing data:
log-PG0 = 3832.506447 log-L0 = -483964.113249
Distributing workload to threads:
Thread 0 : loci [ 1 - 26), Patterns/Seqs/Load : 1063 / 250 / 10630
Thread 1 : loci [26 - 51), Patterns/Seqs/Load : 1034 / 250 / 10340
Thread 2 : loci [51 - 76), Patterns/Seqs/Load : 1011 / 250 / 10110
Thread 3 : loci [76 - 101), Patterns/Seqs/Load : 1007 / 250 / 10070
0:00 taken to read and process data..
Restarting timer...
The next part provides a description of the values printed on screen during the MCMC run. BPP prints many values on screen which we can cluster into three categories:
- The first column is a progress indicator with the percentage of the the completed MCMC run. Negative progress means we are still in the burn-in.
- The next block of columns shows the acceptance proportions for each activated MCMC move. In our case we have 8 active proposals.
- The remaining columns indicate information about the chain at that point in the run, e.g. running means for the root theta and tau, sum of gene tree log-densities, and mean log-likelihood.
If finetune
was set to 1, BPP will adjust step-length values during burn-in, such that acceptance proportions (Pjump) are around 30%.
BPP typically does four rounds of adjustments and prints on screen the Current Pjump (acceptance proportions for each enabled MCMC move at that step), Current finetune (step-lengths at that step) and New finetune (the new adjusted step-lengths).
-*- Terms index -*-
Prgs: progress of MCMC run (negative progress means burnin)
Gage: gene-tree age proposal
Gspr: gene-tree SPR proposal
th1: species tree theta proposal (tips) (sliding window)
th2: species tree theta proposal (inner) (sliding window)
thg: species tree theta proposal (inner) (gibbs sampler)
tau: species tree tau proposal
mix: mixing proposal
stree: Sspr (0.8000) & Ssnl (0.2000)
theta1: root node mean theta
tau1: root node mean tau
log-PG: log-probability of gene trees (MSC)
log-L: mean log-L of observing data
| Acceptance proportions |
Prgs | Gage Gspr th1 th2 thg tau mix Sspr Ssnl | theta1 tau1 log-PG log-L
---------------------------------------------------------------------------------------------------
-15% 0.33 0.41 0.82 0.57 0.96 0.08 0.01 0.0208 0.0000 0.0349 0.0127 3561.44170 -451094.96385 1:50
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32866 0.41003 0.82230 0.56874 0.08137 0.00616
Current finetune: 5.00000 0.00100 0.00050 0.00200 0.00100 0.30000
New finetune: 5.56999 0.00147 0.00342 0.00488 0.00025 0.00570
=> 'finetune = 1 Gage:5.569988 Gspr:0.001474 th1:0.003424 th2:0.004880 tau:0.000252 mix:0.005697'
-10% 0.33 0.34 0.08 0.38 0.96 0.30 0.64 0.0282 0.0000 0.0347 0.0129 3551.28415 -451078.26223 3:39
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32886 0.33729 0.07947 0.37578 0.30108 0.64096
Current finetune: 5.56999 0.00147 0.00342 0.00488 0.00025 0.00570
New finetune: 6.20945 0.00169 0.00084 0.00642 0.00025 0.01768
=> 'finetune = 1 Gage:6.209451 Gspr:0.001694 th1:0.000843 th2:0.006416 tau:0.000253 mix:0.017678'
-5% 0.33 0.31 0.51 0.27 0.96 0.30 0.27 0.0290 0.0000 0.0343 0.0132 3516.38393 -451079.71317 5:29
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32848 0.31228 0.51429 0.27206 0.30176 0.27328
Current finetune: 6.20945 0.00169 0.00084 0.00642 0.00025 0.01768
New finetune: 6.91260 0.00178 0.00173 0.00573 0.00026 0.01588
=> 'finetune = 1 Gage:6.912598 Gspr:0.001776 th1:0.001731 th2:0.005735 tau:0.000255 mix:0.015881'
0% 0.33 0.30 0.27 0.28 0.96 0.30 0.31 0.0323 0.0000 0.0342 0.0133 3556.70876 -451081.63183 7:18
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32882 0.30421 0.27106 0.27867 0.29685 0.31096
Current finetune: 6.91260 0.00178 0.00173 0.00573 0.00026 0.01588
New finetune: 7.70501 0.00180 0.00154 0.00527 0.00025 0.01656
=> 'finetune = 1 Gage:7.705014 Gspr:0.001805 th1:0.001541 th2:0.005268 tau:0.000252 mix:0.016563'
5% 0.33 0.30 0.28 0.40 0.96 0.31 0.30 0.0272 0.0000 0.0344 0.0132 3514.34290 -451080.55856 9:08
10% 0.33 0.30 0.29 0.39 0.96 0.31 0.30 0.0270 0.0000 0.0343 0.0132 3553.07286 -451080.50897 10:57
15% 0.33 0.30 0.30 0.39 0.96 0.30 0.30 0.0283 0.0000 0.0344 0.0131 3538.70714 -451079.84679 12:47
20% 0.33 0.30 0.30 0.39 0.96 0.30 0.30 0.0293 0.0000 0.0344 0.0131 3518.19284 -451079.83587 14:36
25% 0.33 0.30 0.30 0.39 0.96 0.30 0.29 0.0298 0.0000 0.0345 0.0131 3506.70432 -451079.59238 16:25
30% 0.33 0.30 0.30 0.39 0.96 0.30 0.29 0.0285 0.0000 0.0346 0.0130 3508.08477 -451079.14758 18:14
35% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0284 0.0000 0.0346 0.0130 3500.53754 -451079.29382 20:02
40% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0288 0.0000 0.0346 0.0130 3506.33342 -451079.19990 21:51
45% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0293 0.0000 0.0346 0.0130 3566.63903 -451079.05243 23:40
50% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0302 0.0000 0.0346 0.0130 3519.86549 -451078.99683 25:29
55% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0306 0.0000 0.0346 0.0130 3531.45622 -451078.75825 27:18
60% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0304 0.0000 0.0346 0.0130 3536.83618 -451078.95222 29:07
65% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0304 0.0000 0.0346 0.0130 3534.57121 -451078.85668 30:56
70% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0303 0.0000 0.0346 0.0130 3524.73794 -451078.88717 32:46
75% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0300 0.0000 0.0346 0.0130 3555.70437 -451078.95513 34:35
80% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0300 0.0000 0.0346 0.0130 3545.63726 -451078.79563 36:24
85% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0303 0.0000 0.0346 0.0130 3555.11558 -451078.82334 38:13
90% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0300 0.0000 0.0346 0.0130 3540.76753 -451078.89931 40:02
95% 0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0298 0.0000 0.0346 0.0130 3564.16339 -451078.86333 41:52
100% 0.33 0.30 0.32 0.39 0.96 0.30 0.29 0.0295 0.0000 0.0346 0.0130 3532.69089 -451078.84503 43:41
43:41 spent in MCMC
Note: After each round of step length optimization, BPP outputs a line with the optimizes step length values. This line can be directly incorporated into your control file for improved performance:
finetune = 1 Gage:7.705014 Gspr:0.001805 th1:0.001541 th2:0.005268 tau:0.000252 mix:0.016563
To improve convergence rate in subsequent runs, replace the existing finetune
line in your control file with the updated line containing the adjusted values, and re-run BPP.
In the last part of the output, BPP summarizes the MCMC sample. There are four sections:
Species in order:
1. Adig
2. Agra
3. Agre
4. Amad
5. Arub
6. Smic
(A) Best trees in the sample (10 distinct trees in all)
37879 0.37879 0.37879 (((Adig, (Agra, (Amad, Arub))), Agre), Smic);
36816 0.36816 0.74694 (((Adig, Agre), (Agra, (Amad, Arub))), Smic);
13306 0.13306 0.88000 ((Adig, ((Agra, (Amad, Arub)), Agre)), Smic);
5683 0.05683 0.93683 (((Adig, ((Agra, Amad), Arub)), Agre), Smic);
3542 0.03542 0.97225 (((Adig, Agre), ((Agra, Amad), Arub)), Smic);
1337 0.01337 0.98562 ((Adig, (((Agra, Amad), Arub), Agre)), Smic);
647 0.00647 0.99209 (((Adig, ((Agra, Arub), Amad)), Agre), Smic);
613 0.00613 0.99822 (((Adig, Agre), ((Agra, Arub), Amad)), Smic);
177 0.00177 0.99999 ((Adig, (((Agra, Arub), Amad), Agre)), Smic);
1 0.00001 1.00000 (((Adig, Agre), (Agra, (Amad, Smic))), Arub);
(B) Best splits in the sample of trees (11 splits in all)
100000 0.999990 010110
100000 0.999990 111110
88001 0.880001 000110
44209 0.442086 110110
40972 0.409716 101000
14820 0.148199 011110
10562 0.105619 010100
1437 0.014370 010010
1 0.000010 111101
1 0.000010 010101
1 0.000010 000101
(C) Majority-rule consensus tree
((Adig, (Agra, (Amad, Arub) #0.880001) #0.999990, Agre) #0.999990, Smic);
(D) Best tree (or trees from the mastertree file) with support values
(((Adig, (Agra, (Amad, Arub) #0.880001) #0.999990) #0.442086, Agre) #0.999990, Smic); [P = 0.378786]
The 'Species in order:' section will be useful for the Part (B).
Part (A) lists the best trees in the sample (species trees) sorted in descending order of their posterior probabilities. There are four columns for each record (tree):
- The number of times (frequency) the species tree appears in the MCMC sample
- The posterior probability of the tree, i.e. its frequency over the number of samples in the MCMC file (
nsample
+1). - The cumulative posterior probability of all previous (better) trees including the current one, i.e. the sum of posteriors upto that line.
- The species tree in newick notation.
The two best trees differ on the placement of Agre and are illustrated below:
37879 0.37879 0.37879 (((Adig, (Agra, (Amad, Arub))), Agre), Smic);
36816 0.36816 0.74694 (((Adig, Agre), (Agra, (Amad, Arub))), Smic);
The red annotations indicate the differences between the two trees.
Part (B) shows the best splits in the sample of trees along with their posterior probabilities. Each split, often referred to as a bipartition, is defined by a specific branch in the tree. Removing this branch divides the tree into two distinct components or subgraphs, thereby partitioning the taxa into two distinct sets. The figure below illustrates the concept of bipartitions (or splits):
Back to the BPP output. There are three columns:
- The number of times (frequency) the split appears in the MCMC sample. The frequency is computed by constructing bipartitions for each tree in the sample and tallying their occurences.
- The posterior probability of the split, calculated as the frequency over the number of samples in the MCMC file. It provides a measure of confidence in the observed split.
- The bipartition in bit-vector format. The bit-vector succinctly indicates the partition that each taxon falls into (either 0 or 1), following the order specified above the Part A output. In this case, the order is of taxa is: Adig, Agra, Agre, Amad, Arub, Smic. Below you can see the taxon-encoded bipartitions (taxa falling in partition 0 are on the left side of the bipartition).
Frequency | Posterior probability | Bit-encoded bipartition | Taxa encoded bipartition |
---|---|---|---|
100000 | 0.999990 | 010110 | Adig,Agre,Smic | Agra,Amad,Arub |
100000 | 0.999990 | 111110 | Smic | Adig,Agra,Agre,Amad,Arub |
88001 | 0.880001 | 000110 | Adig,Agra,Agre,Smic | Amad,Arub |
44209 | 0.442086 | 110110 | Agre,Smic | Adig,Agra,Amad,Arub |
40972 | 0.409716 | 101000 | Agra,Amad,Arub,Smic | Adig,Agre |
14820 | 0.148199 | 011110 | Adig,Smic | Agra,Agre,Amad,Arub |
10562 | 0.105619 | 010100 | Adig,Agre,Arub,Smic | Agra,Amad |
1437 | 0.014370 | 010010 | Adig,Agre,Amad,Smic | Agra,Arub |
1 | 0.000010 | 111101 | Adig,Agra,Agre,Amad,Smic | Arub |
1 | 0.000010 | 010101 | Adig,Agre,Arub | Agra,Amad,Smic |
1 | 0.000010 | 000101 | Adig,Agra,Agre,Arub | Amad,Smic |
Exercise: Calculate the RF distance between the two best trees (Tree 1 and Tree 2 in the figure above).
Part (C) presents the majority-rule consensus tree in newick format. This tree is constructed by amalgamating all splits with frequency exceeding 50%. This criterion ensures that only the most robust and well-supported splits contribute to the final consensus tree. However, in cases where the available data do not provide sufficient information, the majority-rule consensus tree might contain multifurcations, which indicate instances where the data lack the necessary detail to resolve the corresponding clades. For example, the first bipartition (010110) tells us that Agra, Amad, and Arub appeared in the same clade in all but one of the samples.
Part (D) shows the best tree from the MCMC sample file in newick format and its posterior probability from Part A. This tree is bifurcating as it comes from the sample file, and has support values on each inner node. The support value represents the frequency of the split (Part B) defined by its parental node.
Below are some results from our runs.
Species Tree Inference with Astral | BPP assumptions | BPP control file | Species Tree Inference with BPP | Parameter Estimation with BPP