CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 19

March 14, 2023

Scribes: Brook Eyob, Yifan Wang

Prediction with Expert Advice

Framework

Suppose we have N experts. Now consider the following $T$-day game: In day $t \in [T]$:

Each expert i predicts $z_t^i \in [0,1]$
According to the experts' advice, learner predicts $\hat y_t \in [0,1]$
Learner observes $y_t \in [0,1]$ and suffers $\ell(\hat y_t, y_t)$

Our goal is to use the experts' advice to minimize the total loss. Specifically, we have the following definition:

Definition 1: For the Prediction with Expert Advice problem, we define the algorithm loss

$$L_T^A :=\sum_{t=1}^{T}\ell(\hat y_t,y_t),$$

the expert loss

$$L_T^i :=\sum_{t=1}^{T}\ell(z_t^i,y_t),$$

and the performance measure

$$\text{Reg}_{T}^{A} := L_{T}^{A} - L_{T}^{i^{\star}}, \text{ where }i^{\star}:= \text{arg}\min_{i \in [N]}{L_{T}^{i}}.$$

Our goal is to design an algorithm that achieves a sublinear-in- $T$ regret. For a general loss function $l$, we can't hope to do anything. Therefore, we make a reasonable assumption that $\ell(\hat y, y ) : [0,1] \times [0,1] \rightarrow [0,1]$ is convex with respect to $\hat y$ (examples: $\ell(\hat y , y) = |\hat y - y|,(\hat y , y)^2,\dots$).

Remark: In class we only required $l$ to be a convex real function. However, the assumption $l(\hat y_t, y_t) \in [0, 1]$ is critical for the proof of EWA algorithm.

Follow-The-Leader (FTL)

We first look at the Follow-The-Leader algorithm:

Let $i_{t}^{\star} := \arg \min_{i \in [N]} L_{t-1}^{i}$. In day $t$, the FTL algorithm follows $i^{\star}_t$, i.e., we play $\hat y_t =z_t^{i^{\star}_t}$

The performance of FTL algorithm is bad, because it suffers linear regret:

Consider an example with $N = 2$.

For day $1$, we have $l(z_1^1, y_1) = 0.51$ and $l(z_1^2, y_1) = 0.49$.

Starting from the second day, we have $l(z_t^1, y_t) = 0 , l(z_t^2, y_t) = 1$ when $t$ is even, and $l(z_t^1, y_t) = 1 , l(z_t^2, y_t) = 0$ when $t$ is odd.

It's easy to verify that $L_T^{FTL} \approx T$, while $L_T^1 \approx \frac{T}{2}$, So the regret is roughly $\frac{T}{2}$.

Follow-The-Leader is unstable since it is a greedy algorithm and is sensitive to "leader changes". We need a smoother algorithm that is not as sensitive.

Exponential Weights Algorithm (EWA)

EWA is less sensitive to changes in leader and provides sub-linear regret. We first introduce the algorithm:

Define

$$ W_t := \sum_{j=1}^{N}w_{t}^{j}. $$

EWA Algorithm:

Initialization: $w_1 = [1,\dots,1]$

For t = $1,\dots,T$ do

Each expert predicts $z_t^i \in [0,1]$
Learner plays

$$ \hat y_t := \sum_{i=1}^{N}\frac{w_{t}^{i}}{W_t} \cdot z_{t}^{i}. $$

Learner observes $y_t\in[0,1]$ and updates $w_{t+1}^{i} = w_{t}^{i} \cdot \exp(-\eta \ell(z_{t}^{i},y^{t}))$ for all $i \in [N]$.

End For

The last update can be applied recursively to obtain $$w_{t+1}^{i} = w_{1}^{i} \cdot \exp\left(-\eta \sum_{j=1}^{t}\ell(z_{j}^{i},y^{t}) \right)$$

Now we show that the EWA algorithm has good performance:

Theorem 2: $\forall i \in [N]$, we have $$L_T^{EWA} \leq \frac{\log N + \eta L_T^i}{1 - \exp(-\eta)}$$

Follow Theorem 2, we have

Corollary 3: By carefully choosing the parameter $\eta$,

$$ \text{Reg}_T^{EWA} = L_T^{EWA} - L_T^{i^{\star}} \leq \log N + \sqrt{2L_T^{i^{\star}} \log N} = O(\sqrt{T \log N}). $$

Now we prove Theorem 2:

Proof of Theorem 2: Define potential function

$$ \Phi(t) := -\log W_t = -\log \big(\sum_{i = 1}^N w_t^i\big). $$

The key idea of the proof is the following claim:

Claim 4: For all $t \in [T]$, we have

$$ l(\hat y_t, y_t) \leq(\Phi(t+1) - \Phi(t)) \cdot \frac{1}{1 - \exp(-\eta)}. $$

Assume Claim 4 is true. Then summing the inequality above for all $t \in [T]$, we have

$$ \begin{aligned} L_T^{EWA} = \sum_{t = 1}^T l(\hat y_t, y_t) &\leq \frac{1}{1 - \exp(-\eta)} \sum_{t = 1}^T (\Phi(t+1) - \Phi(t)) \\ &= \frac{1}{1 - \exp(-\eta)} \cdot (\Phi(T+1) - \Phi(1)) \\ \end{aligned} $$

We will plug in the following and continue the proof

$$ \begin{aligned} -\Phi(1) &= \log \left( \sum_{i=1}^{N}w_{1}^{i} \right)= \log{N}, \\ \Phi(T+1) &= -\log \left( \sum_{i=1}^{N}w_{T+1}^{i} \right) \leq -\log{w_{T+1}^{i}}. \\ \end{aligned} $$

Then

$$ \begin{aligned} L_T^{EWA} &\leq \frac{1}{1 - \exp(-\eta)} \cdot \left(\log N - \log(w_{T+1}^i) \right) \\ &= \frac{1}{1 - \exp(-\eta)} \cdot \left(\log N - \log\left(w_1^i \cdot \exp\left(-\eta \cdot \sum_{t = 1}^T l(z_t^i, y_t)\right)\right) \right) \\ &= \frac{\log N + \eta L_T^i}{1 - \exp(-\eta)}. \end{aligned} $$

Therefore, it only remains to prove Claim 4.

Proof of Claim 4: It's sufficient to show that $\Phi(t+1) - \Phi(t) \geq (1 - e^{-\eta}) \cdot l(\hat y_t, y_t)$. According to the definition of $\Phi(t)$, we have

$$ \begin{aligned} \Phi(t+1) - \Phi(t) &= -\log \left(\frac{ W_{t+1}}{W_t}\right) \\ &= -\log \left(\sum_{i = 1}^N \frac{ w_{t}^i \cdot \exp(-\eta \cdot l(z_t^i, y_t))}{W_t}\right) && (\text{Step 3 of EWA})\\ \\ &= -\log \left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot \exp(-\eta \cdot l(z_t^i, y_t))\right). \end{aligned} $$

To continue the proof, we introduce the following inequality: For all $x \in [0, 1]$ and $s \in \mathbb{R}$, we have

$$ e^{sx} \leq x\cdot e^s + (1-x) \cdot e^0 = 1 + (e^s - 1) x, $$

where the first inequality follows the convexity of the function $e^{sx}$. Now apply the inequality to all $l(z_t^i, y_t)$, we have

$$ \begin{aligned} \Phi(t+1) - \Phi(t) &= -\log \left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot \exp(-\eta \cdot l(z_t^i, y_t))\right) \\ \\ & \geq -\log \left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot \left(1 + (e^{-\eta} - 1) l(z_t^i, y_t)\right)\right) \\ \\ & = -\log \left(1 + (e^{-\eta} - 1) \cdot \sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot l(z_t^i, y_t)\right) \\ & \geq (1 - e^{-\eta}) \cdot \sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot l(z_t^i, y_t) && (\log(1+x) \leq x) \\ &\geq (1 - e^{-\eta}) \cdot l\left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot z_t^i, y_t \right) &&(\text{Jensen's Inequality}) \\ &= (1 - e^{-\eta})\cdot l(\hat y_t, y_t). \end{aligned} $$

Remark: In class we used a different way to prove Claim 4 by introducing the following claim without the proof: Given random variable $X \in [0, 1]$ and $s \in \mathbb{R}$, we have

$$ \log (\mathbb{E}[\exp(sX)]) \leq (e^s - 1) \cdot \mathbb{E}[X]. $$

In fact this inequality is implicitly proved in the lecture notes, but only for the special random variable $X_t: X_t = l(z_t^i, y_t) ~~ w.p.~~ \frac{w_{t}^i}{W_t}$. Generalizing the procedure gives a formal proof for the claim above: For a random variable $X \in [0, 1]$ and $s \in \mathbb{R}$, the convexity of $e^{sX}$ gives

$$ e^{sX} \leq 1 + (e^s - 1)X. $$

Take the expectation on both sides, we have

$$ \mathbb{E}[e^{sX}] \leq 1 + (e^s - 1)\mathbb{E}[X]. $$

Therefore,

$$ \begin{aligned} \log(\mathbb{E}[e^{sX}]) &\leq \log (1 + (e^s - 1)\mathbb{E}[X]) \\ &\leq (e^s - 1)\mathbb{E}[X]. \end{aligned} $$

CS7545_Sp23_Lecture_19: Prediction with Expert Advice - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 19

March 14, 2023

Scribes: Brook Eyob, Yifan Wang

Prediction with Expert Advice

Framework

Follow-The-Leader (FTL)

Exponential Weights Algorithm (EWA)

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp23_Lecture_19: Prediction with Expert Advice - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 19

March 14, 2023

Scribes: Brook Eyob, Yifan Wang

Prediction with Expert Advice

Framework

Follow-The-Leader (FTL)

Exponential Weights Algorithm (EWA)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️