Hypothesis testing with BPP - bpp/bpp-tutorial-geneflow GitHub Wiki
In model selection we want to choose between two models
where
is the marginal likelihood for model
(Start of note: If you want to consider the probability of a model given the data, you could use Bayes' theorem:
where
For
Evidence against |
||
---|---|---|
0 to 2 | 1 to 3 | not worth more than a bare mention |
2 to 6 | 3 to 20 | positive |
6 to 10 | 20 to 150 | strong |
>10 | >150 | very strong |
To approximate the marginal likelihood, we will use the path-sampling or thermodynamic integration approach, which requires us to generate a series of so-called power posteriors, defined as
which becomes the prior if
We use Gaussian quadrature to approximate the integral over
Determining the number of steps required can be challenging, but we can repeat the analysis with increasing numbers of steps to check the stability of the marginal lnL estimate. We will use 8 steps here to expedite the process, but we will also examine some results with 32 steps to assess the impact of the number of steps on marginal lnL estimates and model selection outcomes.
Here, we will compare the following six models:
Model | Description | Control file |
1 | The backbone species tree from Karimi et al. (2020) | model1.ctl |
2 | Episodic introgression from A. digitata into A. grandidierii | model2.ctl |
3 | Episodic introgression from A. rubrostipa into A. madagascariensis | model3.ctl |
4 | Two episodic introgression events | model4.ctl |
5 | Species tree we inferred on Day 3 | model5.ctl |
6 | An alternative hybridisation hypothesis | model6.ctl |
Download the six control files to your folder:
# you can download each file separately using wget
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model1.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model2.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model3.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model4.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model5.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model6.ctl
# *OR* download all of them using one command (for loop)
for i in {1..6}; do wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model${i}.ctl; done
For each model we will generate the necessary folder structure and the control files for each of the
|-- baobab
| |-- A00
| | |-- baobab.A00.ctl
| | `-- baobab.A00.msci.ctl
| |-- A01
| | `-- baobab.A01.ctl
| |-- baobab.map.txt
| `-- baobab.phy
`-- BPP
`-- bf
|-- model1
| |-- 1
| | `-- model1.ctl.1
| |-- 2
| | `-- model1.ctl.2
| |-- 3
| | `-- model1.ctl.3
| |-- 4
| | `-- model1.ctl.4
| |-- 5
| | `-- model1.ctl.5
| |-- 6
| | `-- model1.ctl.6
| |-- 7
| | `-- model1.ctl.7
| |-- 8
| | `-- model1.ctl.8
| |-- betaweights.csv
| |-- model1.ctl
....
`-- model6
|-- 1
| `-- model6.ctl.1
|-- 2
| `-- model6.ctl.2
|-- 3
| `-- model6.ctl.3
|-- 4
| `-- model6.ctl.4
|-- 5
| `-- model6.ctl.5
|-- 6
| `-- model6.ctl.6
|-- 7
| `-- model6.ctl.7
|-- 8
| `-- model6.ctl.8
|-- betaweights.csv
|-- model6.ctl
To generate the above folder structure run the two for loops below:
# create folder structure
cd
mkdir -p DAY-4/BPP/bf/model{1..6}/{1..8}
# download control files
cd ~/DAY-4/BPP/bf
for i in {1..6}; do cd model$i; wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model${i}.ctl; cd ../; done
# create control files with differnet beta values
cd ~/DAY-4/BPP/bf
for i in {1..6}; do cd model$i; bpp --bfdriver model${i}.ctl --points 8; cd ../; done
# copy each of the 8 files of each model to separate subfolders
cd ~/DAY-4/BPP/bf
for i in {1..6}; do cd model$i; for j in {1..8}; do mv model$i.ctl.$j $j; done; cd ../; done
We now have 48 control files (6 models
You can check that your directory and file structure resembles the diagram above (Note: the above diagram is shortened and does not show the directories for models 2 to 5).
cd ~/DAY-4
tree
# or: tree --charset=ascii
Select a model (say X) and one of the 8 files (say Y), and run bpp. IMPORTANT: Change X and Y in the following command to the numbers you picked:
cd ~/DAY-4/BPP/bf/modelX/Y
bpp --cfile modelX.ctl.Y
Each modelX
folder contains a file called betaweights.csv
(depending on BPP version, it may be called modelX.ctrl.betaweights.csv
; we will refer to it as betaweights.csv
, therefore simply rename to betaweights.csv
). We can make use of that file to calculate the marginal log-likelihood once the 8 runs for model X finish. The file has three columns: beta (beta value), weight (quadrature weight), ElnfX (average log-likelihood for power beta). The first two columns are already filled by BPP, and we need to fill the last column with the output from BPP.
At the end of each run you will notice a line:
BFbeta = B E_b(lnf(X)) = L
you can also obtain the value by grepping the output file with BFbeta, once your run finishes:
grep BFbeta out.modelX.txt
you can copy value L into the correspodning line in betaweights.csv
. Make sure the value in the beta column matches the BPP output.
As we might not be able to finish all runs in time during the tutorial, the below table contains the output files of the 48 runs:
Model | 8 output files |
---|---|
1 | bf-small-out1.tar.gz |
2 | bf-small-out2.tar.gz |
3 | bf-small-out3.tar.gz |
4 | bf-small-out4.tar.gz |
5 | bf-small-out5.tar.gz |
6 | bf-small-out6.tar.gz |
Once the betaweights.csv
is complete, we can calculate the marginal log-likelihood for the model with the Equation:
In other words, pairwisely multiply the columns weight and ElnfX, add up the products and divide by two. You can do that with the following command:
cut -f 2,3 -d',' betaweights.csv | tail -n +2 | awk -F ',' '{printf "%.6f\n", $1 * $2}' | paste -sd+ | bc -l | awk '{printf "%.6f\n", $1/2}'
Create a table of marginal log-likelihoods for the six models:
Model | |
---|---|
1 | |
2 | |
3 | |
4 | |
5 | |
6 |
and we should be able to calculate log-scale Bayes factors by subtracting marginal log-likelihoods. For example, to compare model 3 and 6:
Exercise: Below is a table with our completed larger runs with 16 points. You can download the completed betaweights.csv
and try to calculate the marginal log-likelihood for each model, and then decide which model is the most probable. Compare also the marginal likelihoods with the ones produced from the smaller runs with 8 points.
Marginal likelihood calculation results.
An approximation of Bayes factors can be done using the Savage-Dickey density ratio when the models we are comparing are nested, thus avoiding the expensive quadrature algorithm.
Suppose
We define a null region or region of null effects,
The above equation expresses the bayes factor as a ratio of prior and posterior probabilities of the null region for model M1. Note: this is not a log Bayes factor.
For comparing Model 2 (introgression from A. digitata into A. grandidierii) against Model 1 (no gene flow) we have
import pandas as pd
# set our cutoff
cutoff = 0.01
# read the MCMC file for model 2 as a table
data = pd.read_table('mcmc.model2.txt')
# posterior is the proportion of phi samples smaller than cutoff in the MCMC file
posterior = len(data.loc[data['phi_x<-w'] < cutoff]) / len(data)
# calculate prior
# Note: the prior can be calculated using: scipy.stats.beta.cdf(cutoff,a=1,b=1)
# but since Beta(a=1,b=1) corresponds to the uniform distribution, then the cdf = cutoff.
prior = cutoff
# calculate bayes factor
bf = prior / posterior
You can download the following script file:
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/scripts/savage-dickey.py
chmod +x savage-dickey.py
./savage-dickey.py 0.01 "phi:14<-10:x<-w" 1 1 mcmc.model2.txt
Note: it's important to enclose the column name "phi_x<-w" in quotes to escape the character <-, which may otherwise cause issues with the linux shell (bash).
The syntax is: ./savage-dickey.py cutoff column beta_prior_a beta_prior_b mcmcfile
Below is a table of completed long runs for the six models. You can use those to calculate the Savage-Dickey density ratio.
Model |
model1-large.tar.gz |
model2-large.tar.gz |
model3-large.tar.gz |
model4-large.tar.gz |
model5-large.tar.gz |
model6-large.tar.gz |
Exercise: Compare models 2 and 3 against model 1. Use the MCMC files from the above table and the savage-dickey.py script.
Below is an example for comparing model 2 against model 1.
cd ~/DAY-4/SD
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/main/fourth-day/bf/model2-large.tar.gz
tar zxvf model2-large.tar.gz
cd model2-large
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/scripts/savage-dickey.py
./savage-dickey.py 0.01 "phi:14<-10:x<-w" 1 1 mcmc.model2.txt