Bootstrap - ganong-noel/lab_manual GitHub Wiki

Written by Tim Cejka

1. Goal

The purpose of this wiki page is to explain

  1. how a popular and intuitive but "naive" approach to computing bootstrap C.I.-s is incorrect, and
  2. how to compute bootstrap C.I.-s correctly.

As an example, I set the object of interest to be the confidence interval. The general approach in 3.2 is applicable to other objects of interest.

Task: We want to estimate the $100(1-\alpha)$ percentile C.I. for some estimate $\hat{\theta}$.

2. Definitions

A bootstrap confidence interval, say $C_1$, has a good coverage property iff ${Pr}(\theta \in C_1) = 1 - \alpha$.

3. Methods

3.1 Naive bootstrap method: Efron's percentile interval

The procedure

  • This is known as Efron's percentile interval. It is very intuitive but, alas, incorrect.

  • Original approach by Efron was to set the "bootstrap object of interest" as $$\tilde{T}_n=\hat{\theta}$$ which is an estimator for $\theta$.

  • For this $\tilde{T}_n$ we can obtain $\tilde{q}_n^*(\alpha)$ by simulation.

  • Efron 's $100(1-\alpha)$ percentile C.I. for $\theta$ is then defined as

$$ C_1=\left[\tilde{q}_n^{*}(\alpha / 2), \tilde{q}_n^{*}(1-\alpha / 2)\right] $$

The problem

  • To see the poor coverage property of this estimator, let's instructively change the object to interest to be

$$T_n=\hat{\theta}-\theta$$

  • Let $q_n(\alpha)$ be the $\alpha$-th quantile of $T_n$ and $q^*_n(\alpha)$ be its bootstrap estimate.

  • Using this notation, Efron's percentile C.I. can be written as

$$ C_1=\left[\hat{\theta}+q_n^{*}(\alpha / 2), \hat{\theta}+q_n^{*}(1-\alpha / 2)\right] $$

  • The infeasible estimator of $C_1$ would then be

$$C_1^0=\left[\hat{\theta}+q_n(\alpha / 2), \hat{\theta}+q_n(1-\alpha / 2)\right]$$

  • (It is infeasible since by definition the population object $q_n(\alpha)$ is unobserved.)

  • Recall that $C^0_1$ has good coverage iff ${Pr}(\theta \in C_1^0) = 1 - \alpha$.

  • Consider then:

$${Pr}(\theta \in C_1^0) = $$

$$ = {Pr}(\hat{\theta}+q_n(\alpha / 2) \leq \theta \leq \hat{\theta}+q_n(1-\alpha / 2))$$

$$ = {Pr}(-q_n(\alpha/2)\geq\hat{\theta}-\theta\geq-q_n(1-\alpha/2)) $$

$$ \neq 1-\alpha $$

  • So we have that ${Pr}(\theta \in C_1^0) = 1 - \alpha$ is not true in general. In fact, ${Pr}(\theta \in C_1^0) = 1 - \alpha$ holds only if the distribution is symmetric around 0.

  • This means that Efron's (naive) C.I. performs poorly unless the pdf of $(\hat{\theta)}-\theta)$ is symmetric.

  • To see this method in code, see this link to unmerged development code.

  • Note also that this perhaps the most popular bootstrap C.I. method and you may see it on stackexchange.

  • Lesson: Clearly define your population object, then use bootstrap to approximate it.

3.2 Correct bootstrap method: Percentile interval

  • Let

$$T_n=\hat{\theta}-\theta$$

and $q_n(\alpha)$ be the $\alpha$-th quantile of $T_n$.

  • If we know $q_n(\alpha)$, then

$$1-\alpha={Pr}(q_n(\alpha / 2) \leq T_n \leq q_n(1-\alpha / 2))$$

$$={Pr}(\hat{\theta}-q_n(\alpha / 2) \geq \theta \geq \hat{\theta}-q_n(1-\alpha / 2))$$

  • This means that the exact (infeasible) C.I. is given by

$$C_2^0=\left[\hat{\theta}-q_n(1-\alpha / 2), \hat{\theta}-q_n(\alpha / 2)\right]$$

  • Now that we clarified what population object we are interested in estimating, we can go to the bootstrap world and estimate $q_n(\alpha)$ by $q^*_n(\alpha)$, which is the $\alpha$-th quantile of

$$T_{n(1)}^* \leq T_{n(2)}^* \leq \cdots \leq T_{n(B)}^*$$

where

$$T^{*}_{n(b)}=\hat{\theta}^{*}-\hat{\theta}$$

  • Then the bootstrap estimate of $C_2^0$ is

$$C_2=\left[\hat{\theta}-q_n^{*}(1-\alpha / 2), \hat{\theta}-q_n^{*}(\alpha / 2)\right]$$

  • To see how this method is implemented in code, see script 9_rdfo_analysis_at_cp_lookup.R on branch 257 in RDFO inside the firewall. This wiki focuses solely on the (asymptotic) coverage properties. The code and my February 10 2023 comment in JMPCIRDFO-268 also shows how details such as how to draw the bootstrap sample.