Hypothesis testing with BPP - bpp/bpp-tutorial-geneflow GitHub Wiki

Background

In model selection we want to choose between two models $M_1$ and $M_0$ on the basis of observed data $X$. The parameters of each model are represented by vectors $\theta_1$ and $\theta_2$. We assess the two models using the Bayes Factor, i.e. the ratio of marginal likelihood values for the two models:

$$ B_{10} = \frac{f(X\ |\ M_1)}{f(X\ |\ M_0)} $$

where

$$ f(X\ |\ M) = \int_\theta f(X\ |\ \theta, M) f(\theta\ |\ M)d\theta $$

is the marginal likelihood for model $M$. It is a likelihood function that has been integrated over the parameter space, i.e. the probability of generating the observed sample for all possible parameter values. You may think of it as the probability of the model (but it's not).

(Start of note: If you want to consider the probability of a model given the data, you could use Bayes' theorem:

$$ P(M_k\ |\ X) = \frac{P(X\ |\ M_k)P(M_k)}{P(X)} $$

where $P(X) = \sum_j P(X\ |\ M_j)P(M_j)$ is the total probability of the data considering all possible models. End of note)

For $B_{10}>1$ we have evidence in favor of $M_1$, while $B_{10}<1$ means evidence in favor of $M_0$. We typically operate on log-scale, and we consider the test is significant when $|\log B_{10}| > 4.6$ (i.e., $|B_{10}| > 100$). Below is a table from Kass and Raftery, 1995

$2\log_e(B_{10})$ $B_{10}$ Evidence against $M_0$
0 to 2 1 to 3 not worth more than a bare mention
2 to 6 3 to 20 positive
6 to 10 20 to 150 strong
>10 >150 very strong

To approximate the marginal likelihood, we will use the path-sampling or thermodynamic integration approach, which requires us to generate a series of so-called power posteriors, defined as

$$ f(\theta\ |\ X) \propto f(X\ |\ \theta)^\beta f(\theta), \quad \beta \in [0,1] $$

which becomes the prior if $\beta = 0$ or the posterior if $\beta=1$, so that different values of $\beta$ form a path from the prior to the posterior. The rationale is that varying $\beta$ smoothly interpolates between the prior and posterior, allowing us to connect the two distributions. The logarithm of the marginal likelihood $f(X) = \int_\theta f(\theta)f(X\ |\ \theta)d\theta,$ is then given by

$$ \log f(X\ |\ M) = \int_0^1 {\cal E}_\beta \{ \log f(X\ |\ M) \} d\beta $$

We use Gaussian quadrature to approximate the integral over $\beta$, using $n=8$ points in the Gauss-Legendre rule. The $\beta$ values are given as $\beta_i = \frac{1}{2}(x_i + 1)$, for $i=1,\ldots,n$, where $x_i$, with $-1 < x_i < 1$, are the Gauss-Legendre points. For each $\beta_i$, we run an MCMC algorithm to generate a sample from the power posterior distribution, and then calculate the average of $\log f(X | \theta)$ over the MCMC sample as an approximation to the expectation ${\cal E}_{\beta_k}\{ \log f(X\ |\ \theta)\}$. The integral is then approximated by

$$ \log f(X\ |\ M) \approx \frac{1}{2} \sum_{i=1}^n w_i \times {\cal E}_{\beta_i} \{\log f(X\ |\ \theta)\} $$

Determining the number of steps required can be challenging, but we can repeat the analysis with increasing numbers of steps to check the stability of the marginal lnL estimate. We will use 8 steps here to expedite the process, but we will also examine some results with 32 steps to assess the impact of the number of steps on marginal lnL estimates and model selection outcomes.

Practical

Here, we will compare the following six models:

Model Description Control file
1 The backbone species tree from Karimi et al. (2020) model1.ctl
2 Episodic introgression from A. digitata into A. grandidierii model2.ctl
3 Episodic introgression from A. rubrostipa into A. madagascariensis model3.ctl
4 Two episodic introgression events model4.ctl
5 Species tree we inferred on Day 3 model5.ctl
6 An alternative hybridisation hypothesis model6.ctl

Download the six control files to your folder:

# you can download each file separately using wget
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model1.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model2.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model3.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model4.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model5.ctl
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model6.ctl

# *OR* download all of them using one command (for loop)
for i in {1..6}; do wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model${i}.ctl; done

For each model we will generate the necessary folder structure and the control files for each of the $n=8$ points. The control files for each model are identical except one line that specifies the beta value for the power posterior. To run our analyses, we will use the following folder structure:

|-- baobab
|   |-- A00
|   |   |-- baobab.A00.ctl
|   |   `-- baobab.A00.msci.ctl
|   |-- A01
|   |   `-- baobab.A01.ctl
|   |-- baobab.map.txt
|   `-- baobab.phy
`-- BPP
    `-- bf
        |-- model1
        |   |-- 1
        |   |   `-- model1.ctl.1
        |   |-- 2
        |   |   `-- model1.ctl.2
        |   |-- 3
        |   |   `-- model1.ctl.3
        |   |-- 4
        |   |   `-- model1.ctl.4
        |   |-- 5
        |   |   `-- model1.ctl.5
        |   |-- 6
        |   |   `-- model1.ctl.6
        |   |-- 7
        |   |   `-- model1.ctl.7
        |   |-- 8
        |   |   `-- model1.ctl.8
        |   |-- betaweights.csv
        |   |-- model1.ctl
  ....
        `-- model6
            |-- 1
            |   `-- model6.ctl.1
            |-- 2
            |   `-- model6.ctl.2
            |-- 3
            |   `-- model6.ctl.3
            |-- 4
            |   `-- model6.ctl.4
            |-- 5
            |   `-- model6.ctl.5
            |-- 6
            |   `-- model6.ctl.6
            |-- 7
            |   `-- model6.ctl.7
            |-- 8
            |   `-- model6.ctl.8
            |-- betaweights.csv
            |-- model6.ctl

To generate the above folder structure run the two for loops below:

# create folder structure
cd
mkdir -p DAY-4/BPP/bf/model{1..6}/{1..8}

# download control files
cd ~/DAY-4/BPP/bf
for i in {1..6}; do cd model$i; wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/adansonia/bayes-factors/model${i}.ctl; cd ../; done

# create control files with differnet beta values
cd ~/DAY-4/BPP/bf
for i in {1..6}; do cd model$i; bpp --bfdriver model${i}.ctl --points 8; cd ../; done

# copy each of the 8 files of each model to separate subfolders
cd ~/DAY-4/BPP/bf
for i in {1..6}; do cd model$i; for j in {1..8}; do mv model$i.ctl.$j $j; done; cd ../; done

We now have 48 control files (6 models $\times$ 8 points). Each one of us picks 1-2 control files and runs them.

You can check that your directory and file structure resembles the diagram above (Note: the above diagram is shortened and does not show the directories for models 2 to 5).

cd ~/DAY-4
tree
# or:  tree --charset=ascii

Select a model (say X) and one of the 8 files (say Y), and run bpp. IMPORTANT: Change X and Y in the following command to the numbers you picked:

cd ~/DAY-4/BPP/bf/modelX/Y
bpp --cfile modelX.ctl.Y

Each modelX folder contains a file called betaweights.csv (depending on BPP version, it may be called modelX.ctrl.betaweights.csv; we will refer to it as betaweights.csv, therefore simply rename to betaweights.csv). We can make use of that file to calculate the marginal log-likelihood once the 8 runs for model X finish. The file has three columns: beta (beta value), weight (quadrature weight), ElnfX (average log-likelihood for power beta). The first two columns are already filled by BPP, and we need to fill the last column with the output from BPP. At the end of each run you will notice a line:

BFbeta = B E_b(lnf(X)) = L

you can also obtain the value by grepping the output file with BFbeta, once your run finishes:

grep BFbeta out.modelX.txt

you can copy value L into the correspodning line in betaweights.csv. Make sure the value in the beta column matches the BPP output.

As we might not be able to finish all runs in time during the tutorial, the below table contains the output files of the 48 runs:

Model 8 output files
1 bf-small-out1.tar.gz
2 bf-small-out2.tar.gz
3 bf-small-out3.tar.gz
4 bf-small-out4.tar.gz
5 bf-small-out5.tar.gz
6 bf-small-out6.tar.gz

Once the betaweights.csv is complete, we can calculate the marginal log-likelihood for the model with the Equation:

$$ \log f(X\ |\ M) \approx \frac{1}{2} \sum_{i=1}^n w_i \times {\cal E}_{\beta_i} \{\log f(X\ |\ \theta)\} $$

In other words, pairwisely multiply the columns weight and ElnfX, add up the products and divide by two. You can do that with the following command:

cut -f 2,3 -d',' betaweights.csv | tail -n +2 | awk -F ',' '{printf "%.6f\n", $1 * $2}' | paste -sd+ | bc -l | awk '{printf "%.6f\n", $1/2}'

Create a table of marginal log-likelihoods for the six models:

Model $\log f(X \mid M)$
1
2
3
4
5
6

and we should be able to calculate log-scale Bayes factors by subtracting marginal log-likelihoods. For example, to compare model 3 and 6:

$$ B_{36} = \log f(X \mid M_3) - \log f(X \mid M_6) $$

Exercise: Below is a table with our completed larger runs with 16 points. You can download the completed betaweights.csv and try to calculate the marginal log-likelihood for each model, and then decide which model is the most probable. Compare also the marginal likelihoods with the ones produced from the smaller runs with 8 points.

Model Beta weights file
model-1.tar.gz betaweights.csv
model-2.tar.gz betaweights.csv
model-3.tar.gz betaweights.csv
model-4.tar.gz betaweights.csv
model-5.tar.gz betaweights.csv
model-6.tar.gz betaweights.csv

Marginal likelihood calculation results.

Bayes factor approximation with the Savage-Dickey density ratio

An approximation of Bayes factors can be done using the Savage-Dickey density ratio when the models we are comparing are nested, thus avoiding the expensive quadrature algorithm.

Suppose $(\phi^{(1)},\phi^{(2)},\ldots,\phi^{(N)})$ is an MCMC sample from the posterior $\pi(\phi\ |\ X)$. These are the $\phi$ values (introgression probability) sampled during the MCMC, with the values for other parameters ignored.

We define a null region or region of null effects, $ø : |\phi - \phi_0| < \varepsilon$, inside which $\phi$ is very close to $\phi_0$. The null region is a small part of the parameter space for $M_1$ that represents $M_0$. We then define a Bayes factor to represent the evidence for $M_1$:

$$ B_{10,\varepsilon} = \frac{1 - \mathbb{P}(ø\ |\ X)}{\mathbb{P}(ø\ |\ X)} / \frac{1 - \mathbb{P}(ø)}{\mathbb{P}(ø)} \approx \frac{\mathbb{P}(ø)}{\mathbb{P}(ø\ |\ X)}$$

The above equation expresses the bayes factor as a ratio of prior and posterior probabilities of the null region for model M1. Note: this is not a log Bayes factor.

For comparing Model 2 (introgression from A. digitata into A. grandidierii) against Model 1 (no gene flow) we have $\phi_0 = 0$ (no gene flow) and use a cutoff value of 0.01. We calculate the posterior as the proportion of $\phi$ samples inside Model 2 MCMC file smaller than the cutoff. The prior is simply the cumulative density function for our cutoff value. In our case the prior is $Beta(a,b)$. The Bayes factor is the ratio of the prior and the posterior. The python code below illustrates the calculation.

import pandas as pd

# set our cutoff
cutoff = 0.01

# read the MCMC file for model 2 as a table
data = pd.read_table('mcmc.model2.txt')

# posterior is the proportion of phi samples smaller than cutoff in the MCMC file
posterior =  len(data.loc[data['phi_x<-w'] < cutoff]) / len(data)

# calculate prior
# Note: the prior can be calculated using: scipy.stats.beta.cdf(cutoff,a=1,b=1)
# but since Beta(a=1,b=1) corresponds to the uniform distribution, then the cdf = cutoff.
prior = cutoff

# calculate bayes factor
bf = prior / posterior

You can download the following script file:

wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/scripts/savage-dickey.py
chmod +x savage-dickey.py
./savage-dickey.py 0.01 "phi:14<-10:x<-w" 1 1 mcmc.model2.txt

Note: it's important to enclose the column name "phi_x<-w" in quotes to escape the character <-, which may otherwise cause issues with the linux shell (bash).

The syntax is: ./savage-dickey.py cutoff column beta_prior_a beta_prior_b mcmcfile

Below is a table of completed long runs for the six models. You can use those to calculate the Savage-Dickey density ratio.

Model
model1-large.tar.gz
model2-large.tar.gz
model3-large.tar.gz
model4-large.tar.gz
model5-large.tar.gz
model6-large.tar.gz

Exercise: Compare models 2 and 3 against model 1. Use the MCMC files from the above table and the savage-dickey.py script.

Below is an example for comparing model 2 against model 1.

cd ~/DAY-4/SD
wget https://github.com/bpp/bpp-tutorial-geneflow/raw/main/fourth-day/bf/model2-large.tar.gz
tar zxvf model2-large.tar.gz
cd model2-large
wget https://raw.githubusercontent.com/bpp/bpp-tutorial-geneflow/main/fourth-day/scripts/savage-dickey.py
./savage-dickey.py 0.01 "phi:14<-10:x<-w" 1 1 mcmc.model2.txt
⚠️ **GitHub.com Fallback** ⚠️