CS7545_Sp23_Lecture_05: Deviation Bounds - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 5 - Deviation Bounds

January 24, 2023

Scribes: Shaan Ul Haque, Neelkamal Bhuyan

Statistical Inequalities

Theorem (Markov's Inequality): For any non-negative random variable $X$, $$P(X \geq t\mathbb{E}[X])\leq \frac{1}{t} \ \ \forall \ \ t>0$$ Proof:

$$ \begin{align*} \mathbb{E}[X]&=\int_{0}^\infty xf(x)dx\\ &\geq \int_{t\mathbb{E}[X]}^\infty xf(x)dx\\ &\geq t\mathbb{E}[X]\int_{t\mathbb{E}[X]}^\infty f(x)dx = t\mathbb{E}[X]P(X \geq t\mathbb{E}[X])\\ \Rightarrow \frac{1}{t}&\geq P(X \geq t\mathbb{E}[X]) \end{align*} $$

Theorem (Chebyshev's Inequality): For any random variable $X$, s.t., $\sigma^2=var(X)<\infty$ $$P(|X-\mathbb{E}[X]| \geq t\sigma)\leq \frac{1}{t^2} \ \ \forall \ \ t>0$$ Proof: Consider the R.V. $Y=(X-\mathbb{E}[X])^2$, then using Markov's inequality for $Y$, we get:

$$ \begin{align*} P(Y \geq t^2\mathbb{E}[Y])&\leq \frac{1}{t^2}\\ P((X-\mathbb{E}[X])^2 \geq t^2\mathbb{E}[(X-\mathbb{E}[X])^2)&\leq \frac{1}{t^2}\\ P((X-\mathbb{E}[X])^2 \geq t^2\sigma^2)&\leq \frac{1}{t^2}\\ \Rightarrow P(|X-\mathbb{E}[X]| \geq t\sigma)&\leq \frac{1}{t^2} & &\text{since terms on both sides are positive we can take square root} \\ \end{align*} $$

Definition 5.1 - $\sigma$-subgaussians R.V.s: Let $X$ be R.V. with $\mathbb{E}[X]=0$ that satisfies, for all $\lambda$ $$\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2\sigma^2/2}$$ for some $\sigma>0$ then we say that $X$ is $\sigma$-subgaussians.

Example: $X\sim\mathcal{N}(0,1)$ is 1-subgaussians.

Fact (Hoeffding's Lemma): For any R.V. $X \in [a,b]$ with $\mathbb{E}[X]=0$ we have: $$\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2(b-a)^2/8}$$ i.e., $X$ is $\frac{|b-a|}{2}$-subgaussian.

Heoffding's Theorem

Let any $X_1, \ldots X_n$ be independent random variables with $\mathbb{E}[X_i] = 0$ and $X_i \in [a_i,b_i]$. Then $$P \left (\sum_i^n X_i \geq t \right) \leq \exp \left(\frac{-2t^2}{\sum_i (a_i - b_i)^2} \right)$$

Proof:- Uses Chernoff Technique

For any $\lambda \geq 0$ we have,

$$P\left(\sum_i X_i - \mathbb{E}[\sum_i X_i] \geq t\right) = P\left(\exp\left(\lambda \left(\sum_i X_i - \mathbb{E}[\sum_i X_i]\right)\right) \geq \exp(\lambda t)\right)$$

$$\leq \frac{\mathbb{E} \left[\exp\left(\lambda \left(\sum_i X_i - \mathbb{E}[\sum_i X_i]\right)\right)\right)]}{\exp(\lambda t)}\quad (\text{Markov's Inequality})$$

$$= \frac{\mathbb{E} [\Pi_i \exp\left(\lambda \left(X_i - \mathbb{E} X_i\right)\right)]}{\exp(\lambda t)}$$

$$= \frac{\Pi_i\mathbb{E} [ \exp\left(\lambda \left(X_i - \mathbb{E} X_i\right)\right)]}{\exp(\lambda t)}\quad\quad(\text{Independence})$$

$$= \frac{\Pi_i \exp \left (\lambda^2(b_i-a_i)^2/8 \right )}{\exp(\lambda t)}\quad\quad(\text{Heoffding's Lemma})$$

$$= \exp \left (\frac{\lambda^2}{8}\sum_i (b_i-a_i)^2 - \lambda t\right )$$

$$\leq \exp \left(\frac{-2t^2}{\sum_i (a_i - b_i)^2} \right)$$

where the last inequality comes by minimizing the quadratic in $\lambda$ inside the exponent

Combining Hoeffding and Union Bound

(NOTE: below was added after lecture by Prof. Abernethy)

It is very common that we want to use Hoeffding's Inequality combined with a Union Bound in order to show that large deviations are very unlikely even when the experiment is run a number of times. For example, let's say we have a fair coin (lands heads with probability 1/2) and we toss the coin $n$ times. What is the probability that the coin comes up tails less than 25% of the time? Let's call this a "bad event." We can bound this probability using Hoeffding. Let $X_i = 0/1$ be the i-th coin toss, where 1 means heads and 0 means tails. Then

$$ \begin{align*} \text{Pr}(\text{num. heads } \geq 3n/4) &= \text{Pr}(\sum_{i=1}^n (X_i - 1/2) \geq n/4) \\ &\leq \exp( - 2 (n/2)^2 / n ) = \exp(-n/2) \end{align*} $$

OK great, that's an exponentially small probability! But what if we repeat this experiment 1000 times, which means tossing the coin $1000n$ times? What's the likelihood that this bad event occurs at least one time? Here, we can use the Union Bound in addition to Hoeffding! This, by the way, is an incredibly common trick, and we will use this many times in the course. Let the experiment number be $j$, so $j \in [1000]$, and let $X_i^j = 0/1$ be the i-th coin toss on the j-th experiment. Now let's bound the probability that the bad event occurs once:

$$ \begin{align*} \text{Pr}(\text{num. heads } \geq 3n/4 \text{ for some } j \in [1000]) &= \text{Pr}(\exists j \in [1000] : \sum_{i=1}^n (X_i^j - 1/2) \geq n/4) \\ &\leq \sum_{j=1}^{1000} \text{Pr}(\sum_{i=1}^n (X_i^j - 1/2) \geq n/4) \\ &= \sum_{j=1}^{1000} \exp( - 2 (n/2)^2 / n ) = 1000 \exp(-n/2) \end{align*} $$

Notice how we used the union bound: it let us convert the "exists a j" to a sum over j. This may be a loose bound! But as long as the individual probabilities are very small (I.e. exp(-n/2)) then this is actually a very useful probability bounding technique.

⚠️ **GitHub.com Fallback** ⚠️