CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Jacob Abernethy, Tyler LaBonte

Notes for Lecture 06

January 25, 2024

Scribes: Daniel Zhang and Jade Lintott

The Connection to Machine Learning

Our first task is going to be to connect Hoeffding's inequality to machine learning. But to do that we need a couple concepts and a problem set-up.

Definition 6.1 Independent and Identically Distributed (IID): A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.

For example, $X_1,\ldots,X_n\overset{i.i.d.}{\sim}D$ means that $X_1,\ldots,X_n$ are jointly independent, and drawn from the same distribution. Note that the independence here is not just for pairs of variables. Random variables can be pairwise independent but one may not be independent from all of the rest.

Now we introduce a regime that is clearly related to sampling, and thus gets us closer to machine learning.

Definition 6.2 Sample mean:

$\hat{\mu}_n :=\frac{1}{n}\sum X_i$

Given a distribution $D$ has mean $\mu$ and variance $\sigma^2$, and suppose we draw samples $X_1,\ldots,X_n\overset{i.i.d.}{\sim}D$.

An extremely natural question to ask is how big is $|\hat{\mu}-\mu|$? In other words, how far is the mean we observed from the actual mean? We do the following calculation in an attempt to answer that question.

By Jensen's inequality applied to the convex function $f(\cdot)=\cdot^2$ and random variable $Z=|\hat{\mu}_n-\mu|$,

$$f(\mathbb{E} [Z])\le \mathbb{E}[f(Z)]$$

which expands to

$$(\mathbb{E}[|\hat{\mu}_n-\mu|])^2\le \mathbb{E}[(\hat{\mu}_n-\mu)^2]$$

Taking square roots,

$$\begin{align*} \mathbb{E}[|\hat{\mu}_n-\mu|]&\le \sqrt{\mathbb{E}[(\hat{\mu}_n-\mu)^2]}\\\ &=\sqrt{Var(\hat{\mu}_n)}\\\ &=\sqrt{Var(\dfrac{1}{n}\sum X_i)}\\\ &=\sqrt{\dfrac{1}{n}Var(X_i)}=\dfrac{\sigma}{\sqrt{n}}\\\ \end{align*}$$

So we see that the expected error varies with $O\left(\frac{1}{\sqrt{n}}\right)$. Now this is great in expectation, but what is the probability that we are close to that expectation? Without that, we may feel very uncertain about our estimates. For this, we can finally use Hoeffding's inequality to show that the tails decay very quickly. This means we don't expect the random variable to deviate very far from its mean.

Taking $c$ as the correct constant (see HW) we get that

$$\mathbb{P}(|\underbrace{\dfrac{1}{n}\sum_i X_i}_{\hat{\mu}_n}-\mu|\ge\dfrac{t}{\sqrt{n}})\le c\exp(-\dfrac{n}{2}(\dfrac{t}{\sqrt{n}})^2)=c\exp(-t^2/2)$$

So we get that exponential drop-off in the tail that we want.

Now we get to the undeniably machine learning part of this. Let $f_{\theta}:\chi\to[0,1]$. Here, we can think of $\theta$ are some arbitrary parameters, $\chi$ as the data, and $f_{\theta}$ is our prediction function. We also define a loss function $l(\hat{y},y)$ (for example $|\hat{y}-y|$ or $(\hat{y}-y)^2$), bounded in $[0,1]$. This is a function representing how far our predicted answer is from the true answer.

Let $(x_1,y_1),\ldots,(x_n,y_n)\overset{i.i.d.}{\sim} D$ and let $Z_i:=l(f_{\theta}(x_i),y_i)$ be the loss on one example. Then, we get that the test loss is

$$\hat{L}_n(\theta)=\dfrac{1}{n}\sum Z_i$$

However, while test loss might approximate the value we really want, what we really want is the loss with respect to the actual distribution

$$\overline{L}(\theta)=\mathbb{E}_{(x,y)\sim D}[l(f_{\theta}(x),y)]$$

Then, Hoeffding's inequality says that with probability $\ge 1-\delta$,

$$|\hat{L}_n(\theta)-\overline{L}(\theta)|\le O\left(\sqrt{\dfrac{\log 1/\delta}{n}}\right)$$

Now, we are going to propose a wrong thing as an illustrative example. What if we let $\theta=\hat{\theta}=\text{argmin}_{\theta\in\Theta}\dfrac{1}{n}\sum l(f_{\theta}(x_i),y_i)$? It looks right, since our $\theta$ was arbitrary, but it does not work. The issue is, the $Z_i=l(f_\theta(X_i),\theta_i)$ are no longer independent, so we cannot use Hoeffding's inequality directly. (As a way to illustrate this, imagine you knew the loss on $X_1$ was actually pretty high. Then you would expect the loss on the other data points to be that much lower since otherwise the predictor is not fitting the data very well). Also, it is no longer clear that $\mathbb{E}[\hat{L}_n(\theta)]=\overline{L}(\theta)$. So we need more than this to get the kinds of results we want.

Binary Classification and the Bayes Classifier

The ultimate goal is to show that $\hat{\theta}_n$ is good on the test set. We will get to this in a later lecture. This is a difference between statistics and learning theory. In statistics, we would be happy, since we have a provably good estimate of the mean. But, like we saw in the earlier example, that is different from having a provably good predictor that we learned from the data.

To start, we will pare down our example from a general predictor to binary classification. Let $\chi$ be the input space. Let $D$ be a distribution on $\chi\times\{-1,1\}$. Let $h:\chi\to\{-1,1\}$ be a hypothesis. We then define the following

Definition 6.3: Generalization Error

$$L(h)=\mathbb{P}_{(x,y)\in D}(h(x)\ne y)=\mathbb{E}_{(x,y)\sim D}[\mathbb{1}(h(x)\ne y)]$$

where

$$ \mathbb{1}(h(x)\ne y) = \begin{cases} 1 & h(x)\ne y\\ 0 & \text{otherwise} \end{cases} $$

This gives a metric for how good our hypothesis is on average over distribution $D$. This is a probability, so always lies between $0$ and $1$. If the hypothesis gets everything wrong: the generalization error is $1$; if it gets everything right: the generalization error is $0$.

We would like a hypothesis that gives low generalization error. If every $x$ yields a unique $y$, it is possible to get a generalization error of $0$. If two inputs can yield the same output, then the generalization error will necessarily be nonzero. We define the following the capture that idea about the uncertainty of the distributions:

Definition 6.4: Label Noise Function

$$\eta(x)=\mathbb{P}(y=1|x)$$

If we are in a deterministic setting, $\eta(x)$ will be $0$ or $1$ for all $x$.

Now it turns out, we already know the best possible predictor. For any distribution, the best binary classifier is the Bayes classifier:

Definition 6.5: Bayes Classifier

$$ h_{\text{Bayes}}(x) = \begin{cases} 1 & \text{if } \eta(x) \geq \frac{1}{2}\\ -1 & \text{if } \eta(x) < \frac{1}{2} \end{cases} $$

Definition 6.6: Bayes Error

The Bayes error is $L(h_{Bayes})$

We can then translate our (yet unproved) statement about it being the best to:

The Bayes error is the minimum possible error guaranteed by any classifier.

In other words,

$$L(h_{Bayes})=\min_h L(h)$$

Now we will prove this. For that we will use the following fact:

Definition 6.7: Law of Total Expectation

$\mathbb{E}[Y]=\mathbb{E}_x(\mathbb{E}_y(Y|X))$

Here we will do an example:

Let $Y$ be the number of pepperonis on pizza. Let $X$ be the brand of pizza. Then, the expected number of pepperonis is the sum over all brands of the expected number of pepperonis for each brand, times the probability we pick that brand: $\mathbb{E}[Y]=\sum_x\mathbb{E}[Y|X=x]\mathbb{P}(X=x)$

We now return to the proof that the Bayes classifier is optimal.

$$\begin{align*} L(h)&=\mathbb{E}[\mathbb{1}[h(x)\ne y]]\\\ &=\mathbb{E}_x[\mathbb{E}_y[\mathbb{1}[h(x)\ne y]|x]]\\\ &=\mathbb{E}_x(\mathbb{P}(y=1|x)\mathbb{1}[h(x)\ne 1]+\mathbb{P}(y=-1|x)\mathbb{1}[h(x)\ne -1])\\\ &=\mathbb{E}_x(\eta(x) \mathbb{1}[h(x)=-1]+(1-\eta(x))\mathbb{1}[h(x)=1]) \end{align*}$$

To minimize this, for each $x\in X$, we pick $h(x)\in\{1,-1\}$ so that

$$ h(x)= \begin{cases} 1 &\text{if }\eta(x)\ge 1-\eta(x)\\ -1 &\text{otherwise} \end{cases} $$

But this is exactly the Bayes classifier, which gives us the desired result.

Homework

In this problem we will study an example of the Bayes classifier in a binary classification setting where $\mathcal{X}=\mathbb{R}$. Let $Y=1$ with probability $1/2$ and $Y=-1$ otherwise. Then, we draw $X$ as follows: if $Y=1$ let $X\sim \mathcal{N}(\mu/2,\sigma^2)$ and if $Y=-1$ let $X\sim \mathcal{N}(-\mu/2,\sigma^2)$ where $\mathcal{N}(\mu,\sigma^2)$ is the Gaussian distribution with mean $\mu$ and variance $\sigma^2$.

Graded. Recall that $\eta(x)=\Pr[Y=1|X=x]$. Show that $$\eta(x)=\frac{1}{1+\exp(\frac{-x\mu}{\sigma^2})}.$$ Hint. Use Bayes' rule.
Graded. Compute an analytical expression for $h_{\textnormal{Bayes}}$ (i.e., substitute $\eta(x)$ and simplify the resulting expression).
Graded. Compute the Bayes error $L(h_{\textnormal{Bayes}})$. You can leave your answer in terms of the Gaussian cdf $\Phi$.
Graded. On which point(s) $x\in \mathbb{R}$ is $h_{\textnormal{Bayes}}$ most likely to make a mistake? Why?

CS7545_Sp24_Lecture_06: Binary Classification - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Jacob Abernethy, Tyler LaBonte

Notes for Lecture 06

January 25, 2024

Scribes: Daniel Zhang and Jade Lintott

The Connection to Machine Learning

Binary Classification and the Bayes Classifier

Homework

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp24_Lecture_06: Binary Classification - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Jacob Abernethy, Tyler LaBonte

Notes for Lecture 06

January 25, 2024

Scribes: Daniel Zhang and Jade Lintott

The Connection to Machine Learning

Binary Classification and the Bayes Classifier

Homework

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️