CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Jacob Abernethy

Notes for Lecture 05

January 23, 2024

Scribes: Stephanie Henderson, Rishit Mohan Ahuja

Basics of Probability

A random variable $X$ is a function on some sample space $\Omega \rightarrow \mathbb{R}$. We need measure theory to rigorously define a random variable.

Definition 5.2 (Cumulative Distribution Function)

Given a random variable $X$, cumulative distribution function (CDF) of $X$ is defined as

$F(x) = P(X \leq x)$.

Definition 5.3 (Probability Density Function)

Given a random variable $X$ with a differentiable CDF $F(x)$, then the probability density function (PDF) of $X$ is defined as $f(x) = F'(x)$.

Note: it is possible to have a random variable without a PDF (when the CDF is not differentiable).

Definition 5.4 (Expectation)

Given a random variable $X$, we define the expectation $E[X]$ as follows:

$$E[X] = \int_{-\infty}^{+\infty} x \ dF = \int_{-\infty}^{+\infty} xf(x)dx$$

Where $f$ is the PDF of $X$, so $dF = f(x)dx$

Note: $E$ is linear, meaning

$$E[\alpha X + \beta Y] = \alpha E[X] + \beta E[Y],$$

where $\alpha, \beta \in \mathbb{R}$ and $X$ and $Y$ are random variables. This is often referred to as the principle of Linearity of Expectation.

Definition 5.5 (Independence)

We call two random variables $X, Y$ independent when

$$P[X \in A \wedge Y \in B] = P[X \in A]P[Y \in B]$$

for all measurable $A, B \subseteq \mathbb{R}$.

Fact: When two random variables $X$ and $Y$ are independent, $E[XY] = E[X]E[Y]$

Note: It should be carefully noted that $E[XY] = E[X]E[Y]$ does not generally imply that $X$ and $Y$ are independent.

Definition 5.6 (Variance): Variance measures the spread in a distribution and mathematically it is given as $Var(X) := E[(X - E[X])^2]$

Fact: If $X, Y$ are independent random variables then

$Var(X+Y) = Var(X) + Var(Y)$

The proof of the fact is rather straightforward using the definition of variance given above and then using the fact that $E[XY] = E[X]E[Y]$ when $X$ and $Y$ are independent. This fact will be used since ${(X+Y)}^2$ term will have an $XY$ term.

Theorem (Markov's Inequality): For any non-negative random variable $X$,

$$P(X \geq tE[X]) \leq \frac{1}{t} \quad \forall t > 0$$

Motivation: The probability of obtaining a value of $X$ far away from the mean should be lesser (and can be upper bounded) if the obtained value of the random variable is far away from the mean, compared to if the obtained value is closer to the mean. Example: If I eat $2$ burgers a day then the probability that I will eat $10$ burgers on a given day will be lesser than 20%.

Proof: We will show $P(X \geq t) \leq \frac{E[X]}{t}$.

This is an equivalent statement, simply replace $t$ with $\frac{t}{E[X]}$ for $E[X] \neq 0$.

When $E[X] = 0$, the result is trivially true from the definition of probability. Note that the new $t$ is still positive because $X$ is non-negative and we have already considered the case when $E[X] = 0$.

Let $Z_t := t\mathbb{1}_{[X \geq t]}$. (Indicator function is $1$ if statement true, 0 otherwise)

Note: $Z_t \leq X$. This is easy to show using the two cases below:

Case I) $X \in [0, t)$, $Z_t = 0$

Case II) $X \in [t, \infty)$, $Z_t = t$

Therefore,

$$E[X] \geq E[Z_t] = E[t\mathbb{1}_{[X \geq t]}] = tP(X \geq t)$$

The last equality is true because $Z_t$ is a bernoulli random variable

Theorem (Chebyshev's Inequality): For any random variable $X$, s.t., $\sigma^2 = var(X) < \infty$,

$$P(|X - E[X]| \geq t\sigma) \leq \frac{1}{t^2} \quad \forall t > 0$$

Motivation: This inequality is really the first real deviation bound inequality. We are trying to get a bound in the sense how likely it is that the random variable deviates from its expected value. Therefore, it is somewhat similar to Markov inequality. This inequality basically says that the probability that $X$ deviates from its mean by more than some multiple of $\sigma$ is less than or equal to $\frac{1}{t^2}$.

Proof: Let $\mu = E[X]$, $Z = (X - \mu)^2$

$$P(|X - \mu| \geq t\sigma) = P((X - \mu)^2 \geq t^2\sigma^2)$$

We could square both sides because both the sides are positive.

$$P(Z \geq t^2\sigma^2) \leq \frac{E[Z]}{t^2\sigma^2} = \frac{\sigma^2}{t^2\sigma^2} = \frac{1}{t^2}$$

Here we used the equivalent form of Markov's theorem that is discussed in its proof.

Definition 5.7 Moment Generating Function (MGF):

$M_X(t) = E[e^{tX}]$

Provided this expectation exists for $t$ in some open neighborhood of 0. The MGF, if it exists, uniquely determines the probability distribution of a random variable. The MGF is a way of encapsulating all the moments (mean, variance, skewness, kurtosis, etc.) of a probability distribution in one function. By taking derivatives of the MGF and evaluating at $t=0$, we can find all the moments of the distribution.

Definition 5.8 ($\sigma$-subgaussian):

Let X be a random variable with $E[X] = 0$ that satisfies, for all $\lambda$:

$$E[e^{\lambda X}] \leq e^{\frac{\lambda^2\sigma^2}{2}}$$

for some $\sigma > 0$, then we say $X$ is $\sigma$-subgaussian.

In other words, a random variable is $\sigma$-subgaussian if its moment generating function is bounded by $e^{\frac{\lambda^2\sigma^2}{2}}$.

We use the word subgaussian because gaussian random variables satisfy the above inequality with exact equality. Gaussians are known for having very fast tail decay. As you move several multiples of the variance away from the mean of the distribution, you get a very fast dropoff of the probability distribution. For our purposes, we tend to work not with Gaussians but with bounded random variables, which are very common in machine learning.

Theorem (Hoeffding's Lemma): It turns out that any bounded random variable is going to be subgaussian, and we see this in Hoeffding's Lemma. For any random variable $X \in [a, b]$ with $E[X] = 0$ we have: $$E[e^{\lambda X}] \leq e^{\frac{\lambda^2 (b - a)^2}{8}}$$

That is, $X$ is $\frac{|b - a|}{2}$-subgaussian. In this case $\sigma = \frac{|b - a|}{2}$. The $E[X] = 0$ is just a conveniency and not required for this Lemma; however, this attribute reveals the result in a more simple manner.

Theorem (Hoeffding's Inequality): Let $X_1, ..., X_n$ be independent random variables with $E[X_i] = 0$ and $X_i \in [a_i,b_i]$. Then the following inequality must hold: $$P(\sum\limits_{i}^{n}X_i \geq t) \leq \exp (\frac{-2t^2}{\sum_i (a_i - b_i)^2})$$

Motivation: This is considered a core concept in statistical learning theory. We often assume that the random variables $X_1, ..., X_n$ are identically distributed. We prove a general case where the mean of each $X_i$ is $0$, implying $a_i \leq 0$ and $b_i \geq 0$. We want to show that the probability of the sum of random variables exceeding a threshold $t$ decreases exponentially with $t$. Our approach will involve techniques such as Hoeffding's Lemma, Markov's inequality, and exponential bounds. By treating these variables as independent and using a suitable tuning parameter $\lambda$, we derive what is known as the Chernoff bound.

Fun Fact: Professor was asked to speak on Chernoff's 98th birthday on why Chernoff bound is important in machine learning.

Proof: Let's analyze the probability that the sum of the random variables $X_i$ exceeds some threshold $t$. For all $\lambda \geq 0$, the following holds: $$Pr(\sum\limits_{i=1}^n X_i > t) = Pr(\exp(\lambda \sum\limits_{i=1}^n X_i) > \exp(\lambda t)))$$

We applied a monotonic transformation to both sides of the inequality involving exponentiation. We include the parameter $\lambda$.

Now, we want to apply Markov's inequality. On the left hand side, we have a nonnegative random variable. We can treat $\exp(\lambda \sum\limits_{i=1}^n X_i)$ as the random variable in Markov's inequality.

$$Pr(\exp(\lambda \sum\limits_{i=1}^n X_i) > \exp(\lambda t))) \leq \frac{E[\exp(\lambda \sum\limits_{i=1}^n X_i)]}{\exp(\lambda t)} = \frac{E[\prod\limits_{i=1}^{n} \exp(\lambda X_i)]}{\exp(\lambda t)}$$

In the above, we rewrote the sum over the random variables as a product.

Let's use the fact that $X_i$ are still independent, even after exponentiation. Then, we can take the expectation of the product, and turn it into the product of expectations.

$$\frac{E[\prod\limits_{i=1}^{n} \exp(\lambda X_i)]}{\exp(\lambda t)} = \exp(-\lambda t) \prod\limits_{i=1}^{n} E[\exp(\lambda X_i)]$$

Now we have $n$ expectations. We can apply Hoeffding's Lemma. Recall that Hoeffding's lemma gives us control over the moment generating function for $X_i$. So, we can plug in an upper bound for the moment generating function for $X_i$.

$$\exp(-\lambda t) \prod\limits_{i=1}^{n} E[\exp(\lambda X_i)] \leq \exp(-\lambda t) \prod\limits_{i=1}^{n} \exp(\frac{\lambda^2 (a_i - b_i)^2)}{8})$$

We can further rewrite

$$\exp(-\lambda t) \prod\limits_{i=1}^{n} \exp(\frac{\lambda^2 (a_i - b_i)^2)}{8}) = \exp(\frac{\sum\limits_{i=1}^{n} (a_i - b_i)^2 \lambda^2}{8} - t\lambda)$$

Notice that we've rewritten the right hand side as $e$ to a second order polynomial multiplied by $\lambda$. Recall that this inequality that we are proving holds for any $\lambda \geq 0$. So we can plug in any $\lambda$ that we'd like. We can obtain $\lambda$ simply by minimizing this function in the exponent. From doing this, we find that our choice of $\lambda$ that we can plug in is

$$\lambda = \frac{4t}{\sum\limits_{i=1}^{n} (a_i - b_i)^2}$$

Then, by plugging in this value for $\lambda$, we find:

$$\exp(\frac{\sum\limits_{i=1}^{n} (a_i - b_i)^2 \lambda^2}{8} - t\lambda) = \exp \biggl( \frac{-2t^2}{\sum\limits_{i=1}^{n} (a_i - b_i)^2} \biggr)$$

This is what we wanted to prove! We found the right hand side of Hoeffding's Inequality through the above sequence of steps. The result is complex, and the professor recommends that you make sure you understand the proof and the key ideas that lead to this result. We will be using this inequality as a fact, but an understanding of the proof is helpful.

Homework Problem

Often Hoeffding's Inequality is stated in a different way. Make sure to use Hoeffding to prove this version.

Let $X_1, \ldots, X_m$ be $m$ independent random variables sampled from the same distribution $D$, where $D$ has support on $[-1,1]$, and the mean of $D$ is $\mu$. Then for some $\alpha, \beta, \gamma > 0$ we have the following statement: with probability at least $1-\delta$

$$ \left|\frac 1 m \sum_{i=1}^m X_i - \mu \right| \leq \alpha \sqrt{\frac{\log (\beta /\delta)}{\gamma m}}. $$

When you solve this problem, make sure to get the best values of $\alpha, \beta, \gamma$!

CS7545_Sp24_Lecture_05: Probability Basics - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Jacob Abernethy

Notes for Lecture 05

January 23, 2024

Scribes: Stephanie Henderson, Rishit Mohan Ahuja

Basics of Probability

Homework Problem

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp24_Lecture_05: Probability Basics - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Jacob Abernethy

Notes for Lecture 05

January 23, 2024

Scribes: Stephanie Henderson, Rishit Mohan Ahuja

Basics of Probability

Homework Problem

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️