Bootstrap - ganong-noel/lab_manual GitHub Wiki
Written by Tim Cejka
1. Goal
The purpose of this wiki page is to explain
- how a popular and intuitive but "naive" approach to computing bootstrap C.I.-s is incorrect, and
- how to compute bootstrap C.I.-s correctly.
As an example, I set the object of interest to be the confidence interval. The general approach in 3.2 is applicable to other objects of interest.
Task: We want to estimate the $100(1-\alpha)$ percentile C.I. for some estimate $\hat{\theta}$.
2. Definitions
A bootstrap confidence interval, say $C_1$, has a good coverage property iff ${Pr}(\theta \in C_1) = 1 - \alpha$.
3. Methods
3.1 Naive bootstrap method: Efron's percentile interval
The procedure
-
This is known as Efron's percentile interval. It is very intuitive but, alas, incorrect.
-
Original approach by Efron was to set the "bootstrap object of interest" as $$\tilde{T}_n=\hat{\theta}$$ which is an estimator for $\theta$.
-
For this $\tilde{T}_n$ we can obtain $\tilde{q}_n^*(\alpha)$ by simulation.
-
Efron 's $100(1-\alpha)$ percentile C.I. for $\theta$ is then defined as
$$ C_1=\left[\tilde{q}_n^{*}(\alpha / 2), \tilde{q}_n^{*}(1-\alpha / 2)\right] $$
The problem
- To see the poor coverage property of this estimator, let's instructively change the object to interest to be
$$T_n=\hat{\theta}-\theta$$
-
Let $q_n(\alpha)$ be the $\alpha$-th quantile of $T_n$ and $q^*_n(\alpha)$ be its bootstrap estimate.
-
Using this notation, Efron's percentile C.I. can be written as
$$ C_1=\left[\hat{\theta}+q_n^{*}(\alpha / 2), \hat{\theta}+q_n^{*}(1-\alpha / 2)\right] $$
- The infeasible estimator of $C_1$ would then be
$$C_1^0=\left[\hat{\theta}+q_n(\alpha / 2), \hat{\theta}+q_n(1-\alpha / 2)\right]$$
-
(It is infeasible since by definition the population object $q_n(\alpha)$ is unobserved.)
-
Recall that $C^0_1$ has good coverage iff ${Pr}(\theta \in C_1^0) = 1 - \alpha$.
-
Consider then:
$${Pr}(\theta \in C_1^0) = $$
$$ = {Pr}(\hat{\theta}+q_n(\alpha / 2) \leq \theta \leq \hat{\theta}+q_n(1-\alpha / 2))$$
$$ = {Pr}(-q_n(\alpha/2)\geq\hat{\theta}-\theta\geq-q_n(1-\alpha/2)) $$
$$ \neq 1-\alpha $$
-
So we have that ${Pr}(\theta \in C_1^0) = 1 - \alpha$ is not true in general. In fact, ${Pr}(\theta \in C_1^0) = 1 - \alpha$ holds only if the distribution is symmetric around 0.
-
This means that Efron's (naive) C.I. performs poorly unless the pdf of $(\hat{\theta)}-\theta)$ is symmetric.
-
To see this method in code, see this link to unmerged development code.
-
Note also that this perhaps the most popular bootstrap C.I. method and you may see it on stackexchange.
-
Lesson: Clearly define your population object, then use bootstrap to approximate it.
3.2 Correct bootstrap method: Percentile interval
- Let
$$T_n=\hat{\theta}-\theta$$
and $q_n(\alpha)$ be the $\alpha$-th quantile of $T_n$.
- If we know $q_n(\alpha)$, then
$$1-\alpha={Pr}(q_n(\alpha / 2) \leq T_n \leq q_n(1-\alpha / 2))$$
$$={Pr}(\hat{\theta}-q_n(\alpha / 2) \geq \theta \geq \hat{\theta}-q_n(1-\alpha / 2))$$
- This means that the exact (infeasible) C.I. is given by
$$C_2^0=\left[\hat{\theta}-q_n(1-\alpha / 2), \hat{\theta}-q_n(\alpha / 2)\right]$$
- Now that we clarified what population object we are interested in estimating, we can go to the bootstrap world and estimate $q_n(\alpha)$ by $q^*_n(\alpha)$, which is the $\alpha$-th quantile of
$$T_{n(1)}^* \leq T_{n(2)}^* \leq \cdots \leq T_{n(B)}^*$$
where
$$T^{*}_{n(b)}=\hat{\theta}^{*}-\hat{\theta}$$
- Then the bootstrap estimate of $C_2^0$ is
$$C_2=\left[\hat{\theta}-q_n^{*}(1-\alpha / 2), \hat{\theta}-q_n^{*}(\alpha / 2)\right]$$
- To see how this method is implemented in code, see script
9_rdfo_analysis_at_cp_lookup.R
on branch 257 in RDFO inside the firewall. This wiki focuses solely on the (asymptotic) coverage properties. The code and my February 10 2023 comment in JMPCIRDFO-268 also shows how details such as how to draw the bootstrap sample.