$$L_T = \sum_{t=1}^T l\left(\hat{y_t}, y_t\right)$$$$L_T^i = \sum_{t=1}^T l\left(z_t^i, y_t\right)$$
However, we still want to achieve sublinear regret $R_T = \mathcal{O}(T)$. If this is the case, then
$$\lim_{t \rightarrow \infty} \frac{R_T}{T} = \frac{1}{T}L_T - \frac{1}{T}\min_{i \in N}L_T^i \longrightarrow 0$$
We also make one assumption:$l\left(\hat{y}, y\right)$ is convex with respect to $\hat{y}$.
Suppose that $i_t^* := \arg\min_{i \in [N]} L_{t-1}^i$. Follow the leader in this context entails following $i_t^*$, so $\hat{y_t} = z_t^{i_t^*}$.
Unfortunately, FTL suffers linear regret:
Consider an example with $N = 2$. For the first day, we have $l(z_1^1, y_1)=0.51$ and $l(z_1^2, y_1) = 0.49$.
Now, starting from the second day, the regret oscillates, such that $l(z_t^1, y_t) = 0$ and $l(z_t^2, y_t) = 1$ when $t$ is even, and the other way around when it is odd. As $t$ goes to infinity, this oscillation clearly does not converge to $0$, so our regret is not sub-linear.
Thus, we need to explore other options.
Exponential Weights Algorithm (EWA)
Idea
Compute a score for each expert. in FTL, our score is the lowest cumulative loss. But here we want to do it in a better way. $\hat{y_t}$ is a weighted combination of all experts, $z_t^i$. This way every expert has a way to contribute to the final decision.
Algorithm
Initialize: For each expert $i$, set their score, $w_{t = 1}^i = 1$
Proof of Claim 4: It's sufficient to show that $\Phi(t+1) - \Phi(t) \geq (1 - e^{-\eta}) \cdot l(\hat y_t, y_t)$. According to the definition of $\Phi(t)$, we have
Remark: In class we used a different way to prove Claim 4 by introducing the following claim without the proof: Given random variable $X \in [0, 1]$ and $s \in \mathbb{R}$, we have
In fact this inequality is implicitly proved in the lecture notes, but only for the special random variable $X_t: X_t = l(z_t^i, y_t) ~~ w.p.~~ \frac{w_{t}^i}{W_t}$. Generalizing the procedure gives a formal proof for the claim above: For a random variable $X \in [0, 1]$ and $s \in \mathbb{R}$, the convexity of $e^{sX}$ gives
Sometimes our nice assumptions don't always hold. But maybe things will still work out just fine. For the rest of this problem assume that $W \subset \mathbb{R}^n$ is the learner's decision set, and the learner observes a sequence of functions $f_1, f_2, \ldots, f_T$ mapping $W \rightarrow \mathbb{R}.$ The regret of an algorithm choosing a sequence of $w_1, w_2, \dots$ is defined in the usual way:
Wouldn't it ruin your lovely day if the functions $f_t$ were not convex? Maybe the only two conditions you can guarantee is that the functions $f_t$ are bounded (say in $[0,1]$ ) and are 1-Lipschitz: they satisfy that $|f_t(w)-f_t(w')| \leq\left|w-w'\right|_2$. Prove that, assuming $W$ is convex and bounded, there exists a randomized algorithm with a reasonable expected-regret bound. Something like $\mathbb{E}[R_T] \leq O(\sqrt{n T \log T})$ would be admirable. (Hint: Always good to ask the experts for ideas. And you needn't worry about efficiency.)