Parameter estimation under the MSC (Baobab dataset) - bpp/bpp-tutorial-geneflow GitHub Wiki
If you are running BPP on your own PC please follow the Installation instructions to obtain a BPP binary for your operating system and computer architecture.
We will estimate divergence times and population size parameters using Tree 1 from the A01 analysis:
We will use the following folder structure:
|-- baobab
| |-- A00
| | |-- baobab.A00.ctl
| | `-- baobab.A00.msci.ctl
| |-- A01
| | `-- baobab.A01.ctl
| |-- baobab.map.txt
| `-- baobab.phy
`-- BPP
|-- A00
| |-- r1
| | `-- baobab.A00.ctl
| `-- r2
| `-- baobab.A00.ctl
`-- A01
|-- r1
| `-- baobab.A01.ctl
`-- r2
`-- baobab.A01.ctl
10 directories, 9 files
The steps below outline how to create the folder structure and copy the control files into the appropriate directories for running BPP. The section for downloading and extracting data files is commented out, as this step has already been completed.
# create folder structure
cd
mkdir -p DAY-3/BPP/A00/r{1,2}
# download and extract data files only if we haven't done this before!
cd DAY-3
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/main/data/baobab.tar.gz
tar zxvf baobab.tar.gz
rm baobab.tar.gz
# copy control files to the right folders
cd ~/DAY-3/BPP/A00/
cp ../../baobab/A00/baobab.A00.ctl r1
cp ../../baobab/A00/baobab.A00.ctl r2
To verify that your folder structure matches the one described above, use the following commands:
cd ~/DAY-3
tree --charset=ascii
Ensure that the BPP control files in DAY-3/BPP/A00/r1
and DAY-3/BPP/A00/r2
reference the correct PHYLIP (sequence alignment) and map files. The seqfile
and imapfile
fields should follow the structure shown below:
seqfile = ../../../baobab/baobab.phy
Imapfile = ../../../baobab/baobab.map.txt
You can confirm this by running the following commands:
cd ~/DAY-3/BPP/A00/
cat r1/baobab.A00.ctl
cat r2/baobab.A00.ctl
You can open the file using any of the available editors, such as:
# vi/vim editor
vi baobab.A00.ctl
# emacs editor
emacs baobab.A00.ctl
# nano or pico editor
nano baobab.A00.ctl
pico baobab.A00.ctl
Note: If you accidentally open the vi
or vim
editor and are unsure how to exit, press ESC
a few times and then type :q!
to quit without saving.
Finally, to run BPP, use the following commands:
cd ~/DAY-3/BPP/A00/r1
bpp --cfile baobab.A00.ctl
If you encounter issues with core pinning or see the error message: "Error while pinning thread to core." try running BPP with the --no-pin
option:
bpp --no-pin --cfile baobab.A00.ctl
We need to run at least two independent analyses (with different starting seeds) in order to check for convergence. We will compare runs when they finish.
For each A00 run, BPP will create the following files:
Filename | Description |
---|---|
JOBNAME.mcmc.txt |
MCMC sample file |
JOBNAME.txt |
Output file (log of screen output) |
JOBNAME.SeedUsed |
Seed used for this run |
JOBNAME.conditional_a1b1.txt |
Theta estimates from using conditionals (NEW) |
JOBNAME.FigTree.tre |
Output tree with parameters to be used with FigTree |
JOBNAME.pdf |
Tree visualization in PDF format |
where JOBNAME
is the label provided in the jobname
control file option.
The output displayed on the screen is also saved in a file named according to the jobname
option specified in the control file, with a .txt
extension. For instance, if jobname = long-run
, the output file will be named long-run.txt
. See a sample output file for the Baobab dataset.
Important note: The output shown here will differ from the output of your run. The discrepancies can be attributed to several factors. One major reason is the variance due to different seeds used for the pseudorandom number generator. However, a more crucial distinction lies in the scale of the runs. The output presented here is from a run with the following settings:
burnin = 50000
sampfreq = 5
nsample = 100000
This configuration results in a total of burnin + sampfreq * nsample = 550,000
MCMC steps, with 100,000 samples stored in the MCMC file. In contrast, your current runs are smaller in scale, with settings:
burnin = 50000
sampfreq = 2
nsample = 50000
This corresponds to 150,000 MCMC steps, with 50,000 samples stored.
The first part of the output provides a summary of the input data. It includes a brief description of the data for each locus, followed by the alignments in compressed form. In this compressed format, sites with identical likelihoods according to the specified substitution model are collapsed into a single site, and a weight (the number of times observed) is assigned to each cluster of such sites.
Analysis started at: Fri Nov 29 17:48:52 2024
Using BPP version: 4.8.2
pver sha1: 1a053a3b61831dc4fb680235a5f6e63e8158ea54
Command: bpp --cfile baobab.A00.ctl
Seed: 1739527591 (randomly generated)
Initial species tree: (((Adig, (Agra, (Amad, Arub))), Agre), Smic);
Locus | Model | Sequences | Length | Ambiguous sites | Compressed | Base freqs
-------+-------+-----------+--------+-----------------+------------+------------
1 | JC69 | 10 | 960 | 12 | 19 | Fixed
2 | JC69 | 10 | 933 | 72 | 18 | Fixed
3 | JC69 | 10 | 5945 | 4532 | 103 | Fixed
.
.
98 | JC69 | 10 | 1110 | 231 | 34 | Fixed
99 | JC69 | 10 | 2487 | 1310 | 65 | Fixed
100 | JC69 | 10 | 1443 | 612 | 33 | Fixed
COMPRESSED ALIGNMENTS
10 19 P
^Age001 TGGTCACAAT TCCTTAXXX
^Adi002 TGGTCACAAT TTTCACXXX
.
.
^Ama018 TGGTCCGTAT TCCCATXXX
^Aga002 TAXTTACTAT TCCCATXGX
905 4 3 17 1 2 1 1 2 2 1 1 3 1 6 1 3 3 3
10 18 P
^Age001 CGCGTACCTG CAACTXXX
^Adi002 CGCGTACCTG TGGGAAAC
.
.
^Ama018 CGTGTGCCTC CAGGAAAC
^Aga002 CACATACCTG CAGGAAAC
777 1 2 5 25 3 1 1 1 1 1 2 1 39 1 70 1 1
.
.
.
Per-locus sequences in data and 'species&tree' tag:
C.File | Data | Status | Population
-------+------+--------------------------------------+-----------
3 | 2 | [OK] | Adig
2 | 2 | [OK] | Agra
1 | 1 | [OK] | Agre
2 | 2 | [OK] | Amad
2 | 2 | [OK] | Arub
1 | 1 | [OK] | Smic
Initial MSC density and log-likelihood of observing data:
log-PG0 = 3912.172931 log-L0 = -477118.465373
Distributing workload to threads:
Thread 0 : loci [ 1 - 26), Patterns/Seqs/Load : 1063 / 250 / 10630
Thread 1 : loci [26 - 51), Patterns/Seqs/Load : 1034 / 250 / 10340
Thread 2 : loci [51 - 76), Patterns/Seqs/Load : 1011 / 250 / 10110
Thread 3 : loci [76 - 101), Patterns/Seqs/Load : 1007 / 250 / 10070
0:00 taken to read and process data..
Restarting timer...
The next part provides a description of the values printed on screen during the MCMC run. BPP prints several values on the screen, which we can grouped into three categories:
- Progress Indicator: The first column shows the percentage of the completed MCMC run. A negative value indicates that the burn-in phase is still ongoing.
- Acceptance Proportions: The following set of columns displays the acceptance proportions for each active MCMC move. In our case, there are six active proposals.
- Chain Information: The remaining columns provide details about the chain at that point in the run. This includes running means for three thetas and three taus, the sum of gene tree log-densities, and the mean log-likelihood.
If the finetune
option is set to 1, BPP will optimize the step-length values during burn-in to ensure that acceptance proportions (Pjump) are around 30%.
BPP typically performs four rounds of optimization, during which it prints the following on the screen:
- Current Pjump: The acceptance proportions for each active MCMC move at that step.
- Current finetune: The step-length/tuning parameters at that step..
- New finetune: The newly optimized tuning parameter values.
-*- Terms index -*-
Prgs: progress of MCMC run (negative progress means burnin)
Gage: gene-tree age proposal
Gspr: gene-tree SPR proposal
th1: species tree theta proposal (tips) (sliding window)
th2: species tree theta proposal (inner) (sliding window)
thg: species tree theta proposal (inner) (gibbs sampler)
tau: species tree tau proposal
mix: mixing proposal
theta1: mean theta of node 1
theta2: mean theta of node 2
theta3: mean theta of node 4
tau1: mean tau of node 6
tau2: mean tau of node 7
tau3: mean tau of node 8
log-PG: log-probability of gene trees (MSC)
log-L: mean log-L of observing data
| Acceptance proportions |
Prgs | Gage Gspr th1 th2 thg tau mix | theta1 theta2 theta3 tau1 tau2 tau3 log-PG log-L
------------------------------------------------------------------------------------------------------------------
-15% 0.33 0.41 0.62 0.49 0.96 0.08 0.01 0.0060 0.0054 0.0045 0.0131 0.0050 0.0048 3507.06989 -451086.51198 1:06
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32757 0.40979 0.61698 0.49120 0.07951 0.00544
Current finetune: 5.00000 0.00100 0.00050 0.00200 0.00100 0.30000
New finetune: 5.54774 0.00147 0.00143 0.00382 0.00025 0.00503
=> 'finetune = 1 Gage:5.547741 Gspr:0.001473 th1:0.001429 th2:0.003818 tau:0.000246 mix:0.005031'
-10% 0.33 0.34 0.54 0.42 0.96 0.31 0.68 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3520.02269 -451079.33554 2:10
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32794 0.33754 0.53744 0.42493 0.31041 0.67604
Current finetune: 5.54774 0.00147 0.00143 0.00382 0.00025 0.00503
New finetune: 6.16384 0.00169 0.00316 0.00591 0.00026 0.01770
=> 'finetune = 1 Gage:6.163841 Gspr:0.001694 th1:0.003156 th2:0.005906 tau:0.000256 mix:0.017700'
-5% 0.33 0.31 0.09 0.32 0.96 0.30 0.27 0.0060 0.0054 0.0045 0.0133 0.0049 0.0048 3540.99036 -451080.49243 3:14
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32784 0.31208 0.08610 0.31928 0.30288 0.26992
Current finetune: 6.16384 0.00169 0.00316 0.00591 0.00026 0.01770
New finetune: 6.84577 0.00177 0.00084 0.00636 0.00026 0.01568
=> 'finetune = 1 Gage:6.845772 Gspr:0.001774 th1:0.000843 th2:0.006355 tau:0.000259 mix:0.015680'
0% 0.33 0.30 0.53 0.22 0.96 0.30 0.32 0.0061 0.0054 0.0045 0.0133 0.0049 0.0048 3542.83781 -451081.60695 4:19
Gage Gspr th1 th2 tau mix
Current Pjump: 0.32796 0.30422 0.53035 0.22348 0.29797 0.31660
Current finetune: 6.84577 0.00177 0.00084 0.00636 0.00026 0.01568
New finetune: 7.60659 0.00180 0.00182 0.00457 0.00026 0.01670
=> 'finetune = 1 Gage:7.606587 Gspr:0.001804 th1:0.001820 th2:0.004568 tau:0.000257 mix:0.016704'
5% 0.33 0.30 0.31 0.36 0.96 0.29 0.30 0.0061 0.0054 0.0045 0.0132 0.0049 0.0048 3523.82665 -451080.24863 5:24
10% 0.33 0.30 0.31 0.36 0.96 0.30 0.30 0.0061 0.0054 0.0045 0.0131 0.0049 0.0048 3501.50808 -451079.73981 6:28
15% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3561.59390 -451080.04465 7:33
20% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3553.17454 -451079.61677 8:38
25% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3515.86993 -451079.36352 9:42
30% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3520.03204 -451079.21892 10:47
35% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3550.59561 -451078.98502 11:52
40% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3559.27028 -451079.07914 12:56
45% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3524.05250 -451079.14039 14:01
50% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3524.47610 -451079.09824 15:06
55% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3572.40557 -451079.20605 16:10
60% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3549.63680 -451079.08200 17:15
65% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3570.62197 -451079.14087 18:20
70% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3547.45763 -451079.21316 19:25
75% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3535.99662 -451079.14969 20:29
80% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3527.63186 -451079.15743 21:34
85% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3517.86305 -451079.14391 22:39
90% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3546.90930 -451079.14553 23:44
95% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3550.26861 -451079.17025 24:48
100% 0.33 0.30 0.31 0.37 0.96 0.30 0.30 0.0060 0.0054 0.0045 0.0131 0.0049 0.0048 3514.38788 -451079.23992 25:53
25:53 spent in MCMC
In the last part of the output, BPP processes the MCMC sample file and outputs summary statistics for each parameter in the model:
Node-Index Node-Type Node-Label
---------------------------------
1 Tip Adig
2 Tip Agra
3 Tip Agre
4 Tip Amad
5 Tip Arub
6 Tip Smic
7 Root MRCA( Adig,Agra,Amad,Arub,Agre,Smic )
8 Inner MRCA( Adig,Agra,Amad,Arub,Agre )
9 Inner MRCA( Adig,Agra,Amad,Arub )
10 Inner MRCA( Agra,Amad,Arub )
11 Inner MRCA( Amad,Arub )
param mean median S.D min max 2.5% 97.5% 2.5%HPD 97.5%HPD ESS* Eff* rho1
-------------------------------------------------------------------------------------------------------------------------------------
theta:1 0.006040 0.005988 0.000712 0.003648 0.010132 0.004797 0.007575 0.004697 0.007445 87769.698216 0.877697 0.017950
theta:2 0.005397 0.005346 0.000675 0.003217 0.009515 0.004218 0.006865 0.004139 0.006747 87406.657300 0.874067 0.024494
theta:4 0.004483 0.004439 0.000554 0.002743 0.007615 0.003524 0.005692 0.003465 0.005599 89122.536186 0.891225 0.018690
theta:5 0.006321 0.006250 0.000838 0.003805 0.012485 0.004878 0.008152 0.004770 0.007994 92785.708090 0.927857 0.015366
theta:7 0.034487 0.034377 0.002813 0.024894 0.049607 0.029301 0.040307 0.029073 0.040039 3920.488297 0.039205 0.212978
theta:8 0.022579 0.022524 0.001588 0.016717 0.030232 0.019635 0.025853 0.019512 0.025705 18315.786151 0.183158 0.188558
theta:9 0.011149 0.010185 0.005659 0.000252 0.071120 0.003018 0.024742 0.001992 0.022320 46696.033322 0.466960 0.132084
theta:10 0.014410 0.014168 0.003505 0.001975 0.037859 0.008175 0.022022 0.007827 0.021560 21092.992846 0.210930 0.383358
theta:11 0.011724 0.010982 0.005414 0.000478 0.054301 0.003445 0.024282 0.002460 0.022380 46811.296021 0.468113 0.215536
tau:7 0.013063 0.013067 0.000719 0.010220 0.015840 0.011668 0.014432 0.011656 0.014419 730.088862 0.007301 0.945131
tau:8 0.004934 0.004934 0.000175 0.004177 0.005628 0.004592 0.005275 0.004587 0.005269 5233.307753 0.052333 0.828382
tau:9 0.004810 0.004813 0.000167 0.004033 0.005406 0.004478 0.005133 0.004473 0.005127 7026.127833 0.070261 0.761671
tau:10 0.003677 0.003673 0.000200 0.002910 0.004609 0.003294 0.004079 0.003293 0.004077 15827.931860 0.158279 0.629175
tau:11 0.003259 0.003267 0.000215 0.002246 0.004089 0.002817 0.003660 0.002828 0.003668 14585.703639 0.145857 0.685058
lnL -451079.229628 -451078.981000 24.351237 -451191.935000 -450961.525000 -451127.817000 -451032.477000 -451127.412000 -451032.216000 8556.289635 0.085563 0.249959
List of nodes, taus and thetas:
Node (+1) Tau Theta Label
0 0.000000 0.006040 Adig [ Adig ]
1 0.000000 0.005397 Agra [ Agra ]
2 0.000000 -1.000000 Agre [ Agre ]
3 0.000000 0.004483 Amad [ Amad ]
4 0.000000 0.006321 Arub [ Arub ]
5 0.000000 -1.000000 Smic [ Smic ]
6 0.013063 0.034487 Adig,Agra,Amad,Arub,Agre,Smic [ Adig Agra Agre Amad Arub Smic ]
7 0.004934 0.022579 Adig,Agra,Amad,Arub,Agre [ Adig Agra Agre Amad Arub ]
8 0.004810 0.011149 Adig,Agra,Amad,Arub [ Adig Agra Amad Arub ]
9 0.003677 0.014410 Agra,Amad,Arub [ Agra Amad Arub ]
10 0.003259 0.011724 Amad,Arub [ Amad Arub ]
BPP versions starting v4.8.0 display an additional output section, displaying posterior summaries for parameters whose conditionals are analytically available or can be approximated (thetas, migration rates, introgression probabilities).
The parameters of the conditionals are stored in the file JOBNAME.conditional_a1b1.txt
which is summarized by BPP. Mixing efficiency is typically much higher.
Summarizing parameter estimates using file baobab.conditional_a1b1.txt ...
param mean S.D 2.5% 97.5% 2.5%HPD 97.5%HPD Effu Effy c
--------------------------------------------------------------------------------------------------
theta:1 0.006035 0.000687 0.004833 0.007517 0.004706 0.007368 3.897399 0.872228 9.083285
theta:2 0.005386 0.000648 0.004258 0.006789 0.004149 0.006654 3.275034 0.859708 7.034599
theta:4 0.004474 0.000534 0.003542 0.005625 0.003469 0.005531 3.229788 0.865578 6.480029
theta:5 0.006306 0.000802 0.004924 0.008048 0.004796 0.007888 4.254287 0.922688 6.610829
theta:7 0.034489 0.002784 0.029355 0.040224 0.029155 0.040001 0.039164 0.038079 3.666643
theta:8 0.022575 0.001575 0.019645 0.025806 0.019494 0.025642 0.207533 0.183332 2.747749
theta:9 0.010914 0.005403 0.003074 0.023760 0.001870 0.021386 0.586597 0.508896 1.351882
theta:10 0.014384 0.003458 0.008173 0.021784 0.007647 0.021203 0.217778 0.207091 1.310520
theta:11 0.011514 0.005206 0.003476 0.023462 0.002345 0.021522 0.492198 0.435441 1.360213
param | mean | median | S.D | min | max | 2.5% | 97.5% | 2.5%HPD | 97.5%HPD | ESS |
theta:1 | 0.006040 | 0.005988 | 0.000712 | 0.003648 | 0.010132 | 0.004797 | 0.007575 | 0.004697 | 0.007445 | 87769.698216 |
theta:2 | 0.005397 | 0.005346 | 0.000675 | 0.003217 | 0.009515 | 0.004218 | 0.006865 | 0.004139 | 0.006747 | 87406.657300 |
theta:4 | 0.004483 | 0.004439 | 0.000554 | 0.002743 | 0.007615 | 0.003524 | 0.005692 | 0.003465 | 0.005599 | 89122.536186 |
theta:5 | 0.006321 | 0.006250 | 0.000838 | 0.003805 | 0.012485 | 0.004878 | 0.008152 | 0.004770 | 0.007994 | 92785.708090 |
theta:7 | 0.034487 | 0.034377 | 0.002813 | 0.024894 | 0.049607 | 0.029301 | 0.040307 | 0.029073 | 0.040039 | 3920.488297 |
theta:8 | 0.022579 | 0.022524 | 0.001588 | 0.016717 | 0.030232 | 0.019635 | 0.025853 | 0.019512 | 0.025705 | 18315.786151 |
theta:9 | 0.011149 | 0.010185 | 0.005659 | 0.000252 | 0.071120 | 0.003018 | 0.024742 | 0.001992 | 0.022320 | 46696.033322 |
theta:10 | 0.014410 | 0.014168 | 0.003505 | 0.001975 | 0.037859 | 0.008175 | 0.022022 | 0.007827 | 0.021560 | 21092.992846 |
theta:11 | 0.011724 | 0.010982 | 0.005414 | 0.000478 | 0.054301 | 0.003445 | 0.024282 | 0.002460 | 0.022380 | 46811.296021 |
tau:7 | 0.013063 | 0.013067 | 0.000719 | 0.010220 | 0.015840 | 0.011668 | 0.014432 | 0.011656 | 0.014419 | 730.088862 |
tau:8 | 0.004934 | 0.004934 | 0.000175 | 0.004177 | 0.005628 | 0.004592 | 0.005275 | 0.004587 | 0.005269 | 5233.307753 |
tau:9 | 0.004810 | 0.004813 | 0.000167 | 0.004033 | 0.005406 | 0.004478 | 0.005133 | 0.004473 | 0.005127 | 7026.127833 |
tau:10 | 0.003677 | 0.003673 | 0.000200 | 0.002910 | 0.004609 | 0.003294 | 0.004079 | 0.003293 | 0.004077 | 15827.931860 |
tau:11 | 0.003259 | 0.003267 | 0.000215 | 0.002246 | 0.004089 | 0.002817 | 0.003660 | 0.002828 | 0.003668 | 14585.703639 |
Try completing 4 runs and analyze convergence. A helpful tool is Tracer, but BPP provides some helpful summary statistics too. Here, we provide a python script (scatter-params.py) to help plot posterior means and 95% CI for theta and tau parameters. If we are satisfied, we can use one MCMC analysis from BPP or combine them to report parameter estimates. The plotting script can be executed from the command line with
# download script
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/third-day/scripts/scatter-params.py
# add execute permission
chmod +x scatter-params.py
# run script on your two mcmc files
./scatter-params.py mcmc1.txt mcmc2.txt
Note: If you are using {ana,mini,''}conda or similar package managers and get an error when running the above command, then please run the script using:
python3 scatter-params.py mcmc1.txt mcmc2.txt
The script will produce scatter plots that visualise the posterior means and 95% CI for parameters τ and θ
The below plot shows a longer run
Below are some results from our runs.