Species tree inference (Baobab dataset) - bpp/bpp-tutorial-geneflow GitHub Wiki

Dataset

For the purpose of this tutorial, our dataset is a subset of baobabs (Adansonia) from Karimi et al. (2020). In the original study they targeted 380 nuclear loci via hybrid enrichment but for practical reasons we will use a subset of 50 loci.

As an outgroup we use Scleronema micranthaum. Below is the subset of species along with the number of individuals. The dataset consists of 100 loci.

Species Acronym Sequences
A. digitata Adig 3
A. gregorii Agre 1
A. grandidierii Agra 2
A. madagascariensis Amad 2
A. rubrostipa Arub 2
S. micranthaum Smic 1

Running BPP

If you are running BPP on your own PC please follow the Installation instructions to obtain a BPP binary for your operating system and computer architecture.

We will use the following folder structure:

`-- DAY-3
    |-- baobab
    |   |-- A00
    |   |   |-- baobab.A00.ctl
    |   |   `-- baobab.A00.msci.ctl
    |   |-- A01
    |   |   `-- baobab.A01.ctl
    |   |-- baobab.map.txt
    |   `-- baobab.phy
    `-- BPP
        `-- A01
            |-- r1
            |   `-- baobab.A01.ctl
            `-- r2
                `-- baobab.A01.ctl

9 directories, 7 files

Below are the steps to create the folder structure, download the data files, and copy control files to the folders where we will run BPP:

# create folder structure
cd
mkdir -p DAY-3/BPP/A01/r{1,2}

# download and extract data files
cd DAY-3
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/main/data/baobab.tar.gz
tar zxvf baobab.tar.gz
rm baobab.tar.gz

# copy control files to the right folders
cd ~/DAY-3/BPP/A01/
cp ../../baobab/A01/baobab.A01.ctl r1
cp ../../baobab/A01/baobab.A01.ctl r2

You can check if your folder structure resembles the one described above using the commands:

cd ~/DAY-3
tree --charset=ascii

Now check that the BPP control files in the DAY-3/BPP/A01/r1 and DAY-3/BPP/A01/r2 point to the right PHYLIP (sequence alignment) and map files. They should have the following form:

 seqfile = ../../../baobab/baobab.phy
Imapfile = ../../../baobab/baobab.map.txt

You can check by:

cd ~/DAY-3/BPP/A01/
cat r1/baobab.A01.ctl
cat r2/baobab.A01.ctl

You can use the available editors to open the file, e.g.

# vi/vim editor
vi r1/baobab.A01.ctl

#emacs
emacs r1/baobab.A01.ctl

# nano or pico
nano r1/baobab.A01.ctl
pico r1/baobab.A01.ctl

Note: In case you open the vi/vim editor accidentally and don't know how to get out, press ESC a few times and then type :q!.

Finally, run bpp:

cd ~/DAY-3/BPP/A01/r1
bpp --cfile baobab.A01.ctl

If you run into problems with core pinning, or BPP exits with the error: "Error while pinning thread to core." then try running BPP with the --no-pin option:

bpp --no-pin --cfile baobab.A01.ctl

Typically, we need to make at least two independent runs (with different starting seeds) in order to check for convergence. We will compare runs when they finish.

For each A01 run, BPP will create the following files:

Filename Description
JOBNAME.mcmc.txt MCMC sample file
JOBNAME.txt Output file (log of screen output)
JOBNAME.SeedUsed Seed used for this run

where JOBNAME is the label provided in the jobname control file option.

💤 Synchronization point: Make sure everybody reached this stage


Output screen

The output displayed on the screen is also saved in a file named according to the jobname option specified in the control file, with a .txt extension. For instance, if jobname = long-run, the output file will be named long-run.txt. See a sample output file for the Baobab dataset.

The first part of the output summarizes the input data. In particular there is a short description of the data for each locus, followed by the alignments in compressed form : i.e. sites with the same likelihood according to the specified substitution model are collapsed into a single site and a weight (number of times observed) is set for each such cluster of sites.

Analysis started at: Fri Nov 29 15:28:22 2024
Using BPP version: 4.8.2
pver sha1: 1a053a3b61831dc4fb680235a5f6e63e8158ea54
Command: bpp --cfile baobab.A01.ctl 

Seed: 2051843109 (randomly generated)
Initial species tree: (((Agra, (Smic,Amad)),(Adig,Agre)),Arub);

 Locus | Model | Sequences | Length | Ambiguous sites | Compressed | Base freqs 
-------+-------+-----------+--------+-----------------+------------+------------
     1 |  JC69 |        10 |    960 |              12 |         19 |      Fixed 
     2 |  JC69 |        10 |    933 |              72 |         18 |      Fixed 
     3 |  JC69 |        10 |   5945 |            4532 |        103 |      Fixed 
.
.
.
    98 |  JC69 |        10 |   1110 |             231 |         34 |      Fixed 
    99 |  JC69 |        10 |   2487 |            1310 |         65 |      Fixed 
   100 |  JC69 |        10 |   1443 |             612 |         33 |      Fixed 

COMPRESSED ALIGNMENTS

10 19 P
^Age001     TGGTCACAAT TCCTTAXXX
^Adi002     TGGTCACAAT TTTCACXXX
.
.
^Ama018     TGGTCCGTAT TCCCATXXX
^Aga002     TAXTTACTAT TCCCATXGX
905 4 3 17 1 2 1 1 2 2 1 1 3 1 6 1 3 3 3

10 18 P
^Age001     CGCGTACCTG CAACTXXX
^Adi002     CGCGTACCTG TGGGAAAC
.
.
^Ama018     CGTGTGCCTC CAGGAAAC
^Aga002     CACATACCTG CAGGAAAC
777 1 2 5 25 3 1 1 1 1 1 2 1 39 1 70 1 1

.
.
.

Per-locus sequences in data and 'species&tree' tag:
C.File | Data |                Status                | Population
-------+------+--------------------------------------+-----------
     3 |    2 | [OK]                                 | Adig      
     2 |    2 | [OK]                                 | Agra      
     1 |    1 | [OK]                                 | Agre      
     2 |    2 | [OK]                                 | Amad      
     2 |    2 | [OK]                                 | Arub      
     1 |    1 | [OK]                                 | Smic      


Initial MSC density and log-likelihood of observing data:
log-PG0 = 3832.506447   log-L0 = -483964.113249

Distributing workload to threads:
 Thread 0 : loci [ 1 -  26), Patterns/Seqs/Load : 1063 / 250 / 10630
 Thread 1 : loci [26 -  51), Patterns/Seqs/Load : 1034 / 250 / 10340
 Thread 2 : loci [51 -  76), Patterns/Seqs/Load : 1011 / 250 / 10110
 Thread 3 : loci [76 - 101), Patterns/Seqs/Load : 1007 / 250 / 10070
0:00 taken to read and process data..
Restarting timer...

The next part provides a description of the values printed on screen during the MCMC run. BPP prints many values on screen which we can cluster into three categories:

  1. The first column is a progress indicator with the percentage of the the completed MCMC run. Negative progress means we are still in the burn-in.
  2. The next block of columns shows the acceptance proportions for each activated MCMC move. In our case we have 8 active proposals.
  3. The remaining columns indicate information about the chain at that point in the run, e.g. running means for the root theta and tau, sum of gene tree log-densities, and mean log-likelihood.

If finetune was set to 1, BPP will adjust step-length values during burn-in, such that acceptance proportions (Pjump) are around 30%.

BPP typically does four rounds of adjustments and prints on screen the Current Pjump (acceptance proportions for each enabled MCMC move at that step), Current finetune (step-lengths at that step) and New finetune (the new adjusted step-lengths).

-*- Terms index -*-

  Prgs: progress of MCMC run (negative progress means burnin)
  Gage: gene-tree age proposal
  Gspr: gene-tree SPR proposal
   th1: species tree theta proposal (tips)   (sliding window)
   th2: species tree theta proposal (inner)  (sliding window)
   thg: species tree theta proposal (inner)  (gibbs sampler)
   tau: species tree tau proposal
   mix: mixing proposal
 stree: Sspr (0.8000) & Ssnl (0.2000)
theta1: root node mean theta
  tau1: root node mean tau
log-PG: log-probability of gene trees (MSC)
 log-L: mean log-L of observing data

     |              Acceptance proportions              |
Prgs | Gage Gspr  th1  th2  thg  tau  mix   Sspr   Ssnl | theta1    tau1       log-PG          log-L
---------------------------------------------------------------------------------------------------
 -15%  0.33 0.41 0.82 0.57 0.96 0.08 0.01 0.0208 0.0000   0.0349  0.0127   3561.44170  -451094.96385  1:50

                     Gage    Gspr     th1     th2     tau     mix  
Current Pjump:     0.32866 0.41003 0.82230 0.56874 0.08137 0.00616
Current finetune:  5.00000 0.00100 0.00050 0.00200 0.00100 0.30000
New finetune:      5.56999 0.00147 0.00342 0.00488 0.00025 0.00570

=> 'finetune = 1 Gage:5.569988 Gspr:0.001474 th1:0.003424 th2:0.004880 tau:0.000252 mix:0.005697'

 -10%  0.33 0.34 0.08 0.38 0.96 0.30 0.64 0.0282 0.0000   0.0347  0.0129   3551.28415  -451078.26223  3:39

                     Gage    Gspr     th1     th2     tau     mix  
Current Pjump:     0.32886 0.33729 0.07947 0.37578 0.30108 0.64096
Current finetune:  5.56999 0.00147 0.00342 0.00488 0.00025 0.00570
New finetune:      6.20945 0.00169 0.00084 0.00642 0.00025 0.01768

=> 'finetune = 1 Gage:6.209451 Gspr:0.001694 th1:0.000843 th2:0.006416 tau:0.000253 mix:0.017678'

  -5%  0.33 0.31 0.51 0.27 0.96 0.30 0.27 0.0290 0.0000   0.0343  0.0132   3516.38393  -451079.71317  5:29

                     Gage    Gspr     th1     th2     tau     mix  
Current Pjump:     0.32848 0.31228 0.51429 0.27206 0.30176 0.27328
Current finetune:  6.20945 0.00169 0.00084 0.00642 0.00025 0.01768
New finetune:      6.91260 0.00178 0.00173 0.00573 0.00026 0.01588

=> 'finetune = 1 Gage:6.912598 Gspr:0.001776 th1:0.001731 th2:0.005735 tau:0.000255 mix:0.015881'

   0%  0.33 0.30 0.27 0.28 0.96 0.30 0.31 0.0323 0.0000   0.0342  0.0133   3556.70876  -451081.63183  7:18

                     Gage    Gspr     th1     th2     tau     mix  
Current Pjump:     0.32882 0.30421 0.27106 0.27867 0.29685 0.31096
Current finetune:  6.91260 0.00178 0.00173 0.00573 0.00026 0.01588
New finetune:      7.70501 0.00180 0.00154 0.00527 0.00025 0.01656

=> 'finetune = 1 Gage:7.705014 Gspr:0.001805 th1:0.001541 th2:0.005268 tau:0.000252 mix:0.016563'

   5%  0.33 0.30 0.28 0.40 0.96 0.31 0.30 0.0272 0.0000   0.0344  0.0132   3514.34290  -451080.55856  9:08
  10%  0.33 0.30 0.29 0.39 0.96 0.31 0.30 0.0270 0.0000   0.0343  0.0132   3553.07286  -451080.50897  10:57
  15%  0.33 0.30 0.30 0.39 0.96 0.30 0.30 0.0283 0.0000   0.0344  0.0131   3538.70714  -451079.84679  12:47
  20%  0.33 0.30 0.30 0.39 0.96 0.30 0.30 0.0293 0.0000   0.0344  0.0131   3518.19284  -451079.83587  14:36
  25%  0.33 0.30 0.30 0.39 0.96 0.30 0.29 0.0298 0.0000   0.0345  0.0131   3506.70432  -451079.59238  16:25
  30%  0.33 0.30 0.30 0.39 0.96 0.30 0.29 0.0285 0.0000   0.0346  0.0130   3508.08477  -451079.14758  18:14
  35%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0284 0.0000   0.0346  0.0130   3500.53754  -451079.29382  20:02
  40%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0288 0.0000   0.0346  0.0130   3506.33342  -451079.19990  21:51
  45%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0293 0.0000   0.0346  0.0130   3566.63903  -451079.05243  23:40
  50%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0302 0.0000   0.0346  0.0130   3519.86549  -451078.99683  25:29
  55%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0306 0.0000   0.0346  0.0130   3531.45622  -451078.75825  27:18
  60%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0304 0.0000   0.0346  0.0130   3536.83618  -451078.95222  29:07
  65%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0304 0.0000   0.0346  0.0130   3534.57121  -451078.85668  30:56
  70%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0303 0.0000   0.0346  0.0130   3524.73794  -451078.88717  32:46
  75%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0300 0.0000   0.0346  0.0130   3555.70437  -451078.95513  34:35
  80%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0300 0.0000   0.0346  0.0130   3545.63726  -451078.79563  36:24
  85%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0303 0.0000   0.0346  0.0130   3555.11558  -451078.82334  38:13
  90%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0300 0.0000   0.0346  0.0130   3540.76753  -451078.89931  40:02
  95%  0.33 0.30 0.31 0.39 0.96 0.30 0.29 0.0298 0.0000   0.0346  0.0130   3564.16339  -451078.86333  41:52
 100%  0.33 0.30 0.32 0.39 0.96 0.30 0.29 0.0295 0.0000   0.0346  0.0130   3532.69089  -451078.84503  43:41

43:41 spent in MCMC

Note: After each round of step length optimization, BPP outputs a line with the optimizes step length values. This line can be directly incorporated into your control file for improved performance:

finetune = 1 Gage:7.705014 Gspr:0.001805 th1:0.001541 th2:0.005268 tau:0.000252 mix:0.016563

To improve convergence rate in subsequent runs, replace the existing finetune line in your control file with the updated line containing the adjusted values, and re-run BPP.

In the last part of the output, BPP summarizes the MCMC sample. There are four sections:

Species in order:
   1. Adig
   2. Agra
   3. Agre
   4. Amad
   5. Arub
   6. Smic

(A) Best trees in the sample (10 distinct trees in all)
    37879  0.37879  0.37879 (((Adig, (Agra, (Amad, Arub))), Agre), Smic);
    36816  0.36816  0.74694 (((Adig, Agre), (Agra, (Amad, Arub))), Smic);
    13306  0.13306  0.88000 ((Adig, ((Agra, (Amad, Arub)), Agre)), Smic);
     5683  0.05683  0.93683 (((Adig, ((Agra, Amad), Arub)), Agre), Smic);
     3542  0.03542  0.97225 (((Adig, Agre), ((Agra, Amad), Arub)), Smic);
     1337  0.01337  0.98562 ((Adig, (((Agra, Amad), Arub), Agre)), Smic);
      647  0.00647  0.99209 (((Adig, ((Agra, Arub), Amad)), Agre), Smic);
      613  0.00613  0.99822 (((Adig, Agre), ((Agra, Arub), Amad)), Smic);
      177  0.00177  0.99999 ((Adig, (((Agra, Arub), Amad), Agre)), Smic);
        1  0.00001  1.00000 (((Adig, Agre), (Agra, (Amad, Smic))), Arub);

(B) Best splits in the sample of trees (11 splits in all)
100000 0.999990  010110
100000 0.999990  111110
 88001 0.880001  000110
 44209 0.442086  110110
 40972 0.409716  101000
 14820 0.148199  011110
 10562 0.105619  010100
  1437 0.014370  010010
     1 0.000010  111101
     1 0.000010  010101
     1 0.000010  000101

(C) Majority-rule consensus tree
((Adig, (Agra, (Amad, Arub) #0.880001) #0.999990, Agre) #0.999990, Smic);

(D) Best tree (or trees from the mastertree file) with support values
(((Adig, (Agra, (Amad, Arub) #0.880001) #0.999990) #0.442086, Agre) #0.999990, Smic);   [P = 0.378786]

The 'Species in order:' section will be useful for the Part (B).

Part (A) lists the best trees in the sample (species trees) sorted in descending order of their posterior probabilities. There are four columns for each record (tree):

  1. The number of times (frequency) the species tree appears in the MCMC sample
  2. The posterior probability of the tree, i.e. its frequency over the number of samples in the MCMC file (nsample+1).
  3. The cumulative posterior probability of all previous (better) trees including the current one, i.e. the sum of posteriors upto that line.
  4. The species tree in newick notation.

The two best trees differ on the placement of Agre and are illustrated below:

    37879  0.37879  0.37879 (((Adig, (Agra, (Amad, Arub))), Agre), Smic);
    36816  0.36816  0.74694 (((Adig, Agre), (Agra, (Amad, Arub))), Smic);

The red annotations indicate the differences between the two trees.

Part (B) shows the best splits in the sample of trees along with their posterior probabilities. Each split, often referred to as a bipartition, is defined by a specific branch in the tree. Removing this branch divides the tree into two distinct components or subgraphs, thereby partitioning the taxa into two distinct sets. The figure below illustrates the concept of bipartitions (or splits):

Back to the BPP output. There are three columns:

  1. The number of times (frequency) the split appears in the MCMC sample. The frequency is computed by constructing bipartitions for each tree in the sample and tallying their occurences.
  2. The posterior probability of the split, calculated as the frequency over the number of samples in the MCMC file. It provides a measure of confidence in the observed split.
  3. The bipartition in bit-vector format. The bit-vector succinctly indicates the partition that each taxon falls into (either 0 or 1), following the order specified above the Part A output. In this case, the order is of taxa is: Adig, Agra, Agre, Amad, Arub, Smic. Below you can see the taxon-encoded bipartitions (taxa falling in partition 0 are on the left side of the bipartition).
Frequency Posterior probability Bit-encoded bipartition Taxa encoded bipartition
100000 0.999990 010110 Adig,Agre,Smic | Agra,Amad,Arub
100000 0.999990 111110 Smic | Adig,Agra,Agre,Amad,Arub
88001 0.880001 000110 Adig,Agra,Agre,Smic | Amad,Arub
44209 0.442086 110110 Agre,Smic | Adig,Agra,Amad,Arub
40972 0.409716 101000 Agra,Amad,Arub,Smic | Adig,Agre
14820 0.148199 011110 Adig,Smic | Agra,Agre,Amad,Arub
10562 0.105619 010100 Adig,Agre,Arub,Smic | Agra,Amad
1437 0.014370 010010 Adig,Agre,Amad,Smic | Agra,Arub
1 0.000010 111101 Adig,Agra,Agre,Amad,Smic | Arub
1 0.000010 010101 Adig,Agre,Arub | Agra,Amad,Smic
1 0.000010 000101 Adig,Agra,Agre,Arub | Amad,Smic

Exercise: Calculate the RF distance between the two best trees (Tree 1 and Tree 2 in the figure above).

Part (C) presents the majority-rule consensus tree in newick format. This tree is constructed by amalgamating all splits with frequency exceeding 50%. This criterion ensures that only the most robust and well-supported splits contribute to the final consensus tree. However, in cases where the available data do not provide sufficient information, the majority-rule consensus tree might contain multifurcations, which indicate instances where the data lack the necessary detail to resolve the corresponding clades. For example, the first bipartition (010110) tells us that Agra, Amad, and Arub appeared in the same clade in all but one of the samples.

Part (D) shows the best tree from the MCMC sample file in newick format and its posterior probability from Part A. This tree is bifurcating as it comes from the sample file, and has support values on each inner node. The support value represents the frequency of the split (Part B) defined by its parental node.

Sample result files

Below are some results from our runs.

# Long run Short run
Output Tarball Output Tarball
1 out.long.A01.r1.txt A01.long.r1.tar.gz out.short.A01.r1.txt A01.short.r1.tar.gz
2 out.long.A01.r2.txt A01.long.r2.tar.gz out.short.A01.r2.txt A01.short.r2.tar.gz
3 out.long.A01.r3.txt A01.long.r3.tar.gz out.short.A01.r3.txt A01.short.r3.tar.gz
4 out.long.A01.r4.txt A01.long.r4.tar.gz out.short.A01.r4.txt A01.short.r4.tar.gz

Species Tree Inference with Astral | BPP assumptions | BPP control file | Species Tree Inference with BPP | Parameter Estimation with BPP

⚠️ **GitHub.com Fallback** ⚠️