CS7545_Sp24_Lecture_18 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Guanghui Wang

Notes for Lecture 18: Online Learning

March 26, 2024

Scribes: Anton Lavrouk, Wentao Yue

Framework

Assume we have $N$ experts.

For $t = 1, \ldots, T$, do:

  1. Each expert $i \in {1, \ldots, n}$ predicts $z_t^i \in [0,1]$

  2. Learner predicts $\hat{y_t} \in [0,1]$

  3. Learner suffers loss $l\left(\hat{y_t}, y_t\right)$

End-for

Goal: Minimize Regret $R_T$ $$R_T = L_T - \min_{i \in N}L_T^i$$

where

$$L_T = \sum_{t=1}^T l\left(\hat{y_t}, y_t\right)$$ $$L_T^i = \sum_{t=1}^T l\left(z_t^i, y_t\right)$$ However, we still want to achieve sublinear regret $R_T = \mathcal{O}(T)$. If this is the case, then $$\lim_{t \rightarrow \infty} \frac{R_T}{T} = \frac{1}{T}L_T - \frac{1}{T}\min_{i \in N}L_T^i \longrightarrow 0$$ We also make one assumption: $l\left(\hat{y}, y\right)$ is convex with respect to $\hat{y}$.

Suppose that $i_t^* := \arg\min_{i \in [N]} L_{t-1}^i$. Follow the leader in this context entails following $i_t^*$, so $\hat{y_t} = z_t^{i_t^*}$.

Unfortunately, FTL suffers linear regret:

Consider an example with $N = 2$. For the first day, we have $l(z_1^1, y_1)=0.51$ and $l(z_1^2, y_1) = 0.49$.

Now, starting from the second day, the regret oscillates, such that $l(z_t^1, y_t) = 0$ and $l(z_t^2, y_t) = 1$ when $t$ is even, and the other way around when it is odd. As $t$ goes to infinity, this oscillation clearly does not converge to $0$, so our regret is not sub-linear.

Thus, we need to explore other options.

Exponential Weights Algorithm (EWA)

Idea

Compute a score for each expert. in FTL, our score is the lowest cumulative loss. But here we want to do it in a better way. $\hat{y_t}$ is a weighted combination of all experts, $z_t^i$. This way every expert has a way to contribute to the final decision.

Algorithm

Initialize: For each expert $i$, set their score, $w_{t = 1}^i = 1$

For $t \in {1, \ldots, T}$ do:

  1. Each expert predicts $z_t^i \in [0,1]$.

  2. Learner will pick $$\hat{y_t} = \sum_{i=1}^N \dfrac{w_t^i}{W_t}z_t^i. \quad \text{where} \quad W_t = \sum_{i=1}^N w_t^i$$

  3. Submit our decision $\hat{y_t}$ and suffer a loss $l(\hat{y_t}, y)$.

  4. Update weights: $$\forall i \in [N], w_{t+1}^i = w_t^i \exp(-\eta \cdot l(z_t^i, y_t)).$$

You may notice that since the weight calculation is recursive, it has a closed form, which is

$$w_{t+1}^i = w_1^i \cdot \exp \left( -\eta\sum_{j=1}^t l\left( z_j^i,y^t \right) \right)$$

Theorem

For Exponential Weights, we have $\forall i \in [N]$ $L_T \leq \frac{\eta L_T^i + \log N}{1 - e^{- \eta}}$. By tuning $\eta$ appropriately, $$L_T - L_T^i \leq \sqrt{T\log N} + \log N = \mathcal{O}(\sqrt{T \log N})$$

Proof

This proof was taken from Spring 2023 Lecture 19 scribed by Brook Eyob and Yifan Wang.

Theorem 2: $\forall i \in [N]$, we have $$L_T^{EWA} \leq \frac{\log N + \eta L_T^i}{1 - \exp(-\eta)}$$

Follow Theorem 2, we have

Corollary 3: By carefully choosing the parameter $\eta$,

$$ \text{Reg}_T^{EWA} = L_T^{EWA} - L_T^{i^{\star}} \leq \log N + \sqrt{2L_T^{i^{\star}} \log N} = O(\sqrt{T \log N}). $$

Now we prove Theorem 2:

Proof of Theorem 2: Define potential function

$$ \Phi(t) := -\log W_t = -\log \big(\sum_{i = 1}^N w_t^i\big). $$

The key idea of the proof is the following claim:

Claim 4: For all $t \in [T]$, we have

$$ l(\hat y_t, y_t) \leq(\Phi(t+1) - \Phi(t)) \cdot \frac{1}{1 - \exp(-\eta)}. $$

Assume Claim 4 is true. Then summing the inequality above for all $t \in [T]$, we have

$$ \begin{aligned} L_T^{EWA} = \sum_{t = 1}^T l(\hat y_t, y_t) &\leq \frac{1}{1 - \exp(-\eta)} \sum_{t = 1}^T (\Phi(t+1) - \Phi(t)) \\ &= \frac{1}{1 - \exp(-\eta)} \cdot (\Phi(T+1) - \Phi(1)) \\ \end{aligned} $$

We will plug in the following and continue the proof

$$ \begin{aligned} -\Phi(1) &= \log \left( \sum_{i=1}^{N}w_{1}^{i} \right)= \log{N}, \\ \Phi(T+1) &= -\log \left( \sum_{i=1}^{N}w_{T+1}^{i} \right) \leq -\log{w_{T+1}^{i}}. \\ \end{aligned} $$

Then

$$ \begin{aligned} L_T^{EWA} &\leq \frac{1}{1 - \exp(-\eta)} \cdot \left(\log N - \log(w_{T+1}^i) \right) \\ &= \frac{1}{1 - \exp(-\eta)} \cdot \left(\log N - \log\left(w_1^i \cdot \exp\left(-\eta \cdot \sum_{t = 1}^T l(z_t^i, y_t)\right)\right) \right) \\ &= \frac{\log N + \eta L_T^i}{1 - \exp(-\eta)}. \end{aligned} $$

Therefore, it only remains to prove Claim 4.

Proof of Claim 4: It's sufficient to show that $\Phi(t+1) - \Phi(t) \geq (1 - e^{-\eta}) \cdot l(\hat y_t, y_t)$. According to the definition of $\Phi(t)$, we have

$$ \begin{aligned} \Phi(t+1) - \Phi(t) &= -\log \left(\frac{ W_{t+1}}{W_t}\right) \\ &= -\log \left(\sum_{i = 1}^N \frac{ w_{t}^i \cdot \exp(-\eta \cdot l(z_t^i, y_t))}{W_t}\right) && (\text{Step 3 of EWA})\\ \\ &= -\log \left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot \exp(-\eta \cdot l(z_t^i, y_t))\right). \end{aligned} $$

To continue the proof, we introduce the following inequality: For all $x \in [0, 1]$ and $s \in \mathbb{R}$, we have

$$ e^{sx} \leq x\cdot e^s + (1-x) \cdot e^0 = 1 + (e^s - 1) x, $$

where the first inequality follows the convexity of the function $e^{sx}$. Now apply the inequality to all $l(z_t^i, y_t)$, we have

$$ \begin{aligned} \Phi(t+1) - \Phi(t) &= -\log \left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot \exp(-\eta \cdot l(z_t^i, y_t))\right) \\ \\ & \geq -\log \left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot \left(1 + (e^{-\eta} - 1) l(z_t^i, y_t)\right)\right) \\ \\ & = -\log \left(1 + (e^{-\eta} - 1) \cdot \sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot l(z_t^i, y_t)\right) \\ & \geq (1 - e^{-\eta}) \cdot \sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot l(z_t^i, y_t) && (\log(1+x) \leq x) \\ &\geq (1 - e^{-\eta}) \cdot l\left(\sum_{i = 1}^N \frac{w_{t}^i}{W_t} \cdot z_t^i, y_t \right) &&(\text{Jensen's Inequality}) \\ &= (1 - e^{-\eta})\cdot l(\hat y_t, y_t). \end{aligned} $$

Remark: In class we used a different way to prove Claim 4 by introducing the following claim without the proof: Given random variable $X \in [0, 1]$ and $s \in \mathbb{R}$, we have

$$ \log (\mathbb{E}[\exp(sX)]) \leq (e^s - 1) \cdot \mathbb{E}[X]. $$

In fact this inequality is implicitly proved in the lecture notes, but only for the special random variable $X_t: X_t = l(z_t^i, y_t) ~~ w.p.~~ \frac{w_{t}^i}{W_t}$. Generalizing the procedure gives a formal proof for the claim above: For a random variable $X \in [0, 1]$ and $s \in \mathbb{R}$, the convexity of $e^{sX}$ gives

$$ e^{sX} \leq 1 + (e^s - 1)X. $$

Take the expectation on both sides, we have

$$ \mathbb{E}[e^{sX}] \leq 1 + (e^s - 1)\mathbb{E}[X]. $$

Therefore,

$$ \begin{aligned} \log(\mathbb{E}[e^{sX}]) &\leq \log (1 + (e^s - 1)\mathbb{E}[X]) \\ &\leq (e^s - 1)\mathbb{E}[X]. \end{aligned} $$

Homework 3-4: Online Non-Convex Optimization

Sometimes our nice assumptions don't always hold. But maybe things will still work out just fine. For the rest of this problem assume that $W \subset \mathbb{R}^n$ is the learner's decision set, and the learner observes a sequence of functions $f_1, f_2, \ldots, f_T$ mapping $W \rightarrow \mathbb{R}.$ The regret of an algorithm choosing a sequence of $w_1, w_2, \dots$ is defined in the usual way:

$$R_T:=\sum_{t=1}^T f_t\left(w_t\right)-\min_{w \in W} \sum_{t=1}^T f_t(w)$$

Wouldn't it ruin your lovely day if the functions $f_t$ were not convex? Maybe the only two conditions you can guarantee is that the functions $f_t$ are bounded (say in $[0,1]$ ) and are 1-Lipschitz: they satisfy that $|f_t(w)-f_t(w')| \leq\left|w-w'\right|_2$. Prove that, assuming $W$ is convex and bounded, there exists a randomized algorithm with a reasonable expected-regret bound. Something like $\mathbb{E}[R_T] \leq O(\sqrt{n T \log T})$ would be admirable. (Hint: Always good to ask the experts for ideas. And you needn't worry about efficiency.)

⚠️ **GitHub.com Fallback** ⚠️