Bounded Gradients: $\forall \text{ } w \in W, \forall \text{ } t \in \lbrace 1,...,T \rbrace, ||\nabla f_t(w)|| \leq G$
Goals and Algorithms
The goal of Online Convex Optimization (OCO) is to minimize the regret $R_T$. We compare the cumulative loss of our decisions with the loss of the best decision $w^*$ which minimizes the cumulative loss.
In the last lecture, we discussed the Projected Online Gradient Descent (P-OGD) algorithm to solve the problem of minimizing regret. The update rule for P-OGD is as follows:
$$ w'_{t+1} = w_t - \eta * \nabla f_t(w_t)$$
$$ w_{t+1} = \Pi_w (w'_{t+1})$$
We also proved a bound on the P-OGD regret. Assume that the bounded set and bounded gradient assumptions hold. Then, P-OGD with $\eta = \frac{D}{G \sqrt{T}}$ has a regret as follows:
$$ R_T^{P-OGD} = DG \sqrt{T} $$
Follow the Leader
Introduction
We introduce a new approach to defining the regret as a function of competitors. We compare our decisions with those of the competitor to calculate the regret:
This definition of regret is equivalent to our original definition. If we can guarantee that the regret w.r.t. $w^*$ is sublinear w.r.t. $T$, then we can equivalently say that the regret of our algorithm is sublinear w.r.t. $T$ for any competitor $U$. Essentially, we are able to state the following:
$$ R_T(w^*) = \mathcal{O}(T) \text{ } ^←_→ \text{ } R_T(U) = \mathcal{O}(T), \forall \text{ } U \in W $$
Using the definition of regret, we can relate $R_T(w^*)$ and $R_T(U)$. We know that by definition, $w^*$ is the minimizer of cumulative loss, which maximizes the regret. Thus, for any competitor $U \in W$, the cumulative loss of the competitor can only be equal to or worse than that of $w^*$:
Now, we want to define a new algorithm that solves the problem of minimizing regret, but from a learning perspective.
We know that in an offline setting, we have the following sets:
Training set on which we perform Empirical Risk Minimization (ERM)
Testing set to which we hope the ERM can generalize well
We want to utilize this same framework in an online setting. We define the following sets with loss functions $f_t$:
Training set: $f_{1,...,t}$
Testing set: $f_{t+1}(\cdot)$
We want to pick $w_{t+1}$ such that it minimizes the cumulative loss from rounds 1 to $t$, which is the Follow The Leader (FTL) algorithm. This is a computationally expensive process. Potential workarounds are to use first-order approximations or transform efficient offline algorithms to work online, but this will be further discussed in future lectures. The FTL algorithm chooses $w_{t+1}$ as follows:
How do we know that FTL is a good algorithm? To evaluate this, we introduce the Be The Leader (BTL) lemma for the regret of FTL. We have, $\forall \text{ } U \in W$:
$$\text{where $f_t(w_t)$ is the loss at round $t$ with $w_t$ and $f_t(w_{t+1})$ is the loss at round t with $w_{t+1}$ }$$
The stability of the FTL algorithm is given by $f_t(w_t) - f_t(w_{t+1})$. The regret of an algorithm is based on how stable it is.
If decisions are changed often, then the algorithm is unstable. Thus, the regret is hard to control.
If decisions are fairly consistent, then the algorithm is stable because decisions do not vary much. Thus, the regret will be small or 0.
Proof of BTL Lemma
We will use induction for this proof. Our goal is to show, $\forall \text{ } U \in W$,
$$\sum_{t=1}^T f_t(w_{t+1}) \leq \sum_{t=1}^T f_t(U) \text{ which is equivalent to saying } \sum_{t=1}^T f_t(w_{t+1}) - \sum_{t=1}^T f_t(U) \leq 0$$
There is an algorithm with regret of 0, but this is cheating because we use the decision of $w_{t+1}$ to reach this regret. Thus, we can force $w_t$ to be close to $w_{t+1}$ in order to make the regret smaller. If $w_t$ and $w_{t+1}$ are far apart, then the regret is larger.
For $T=1$, we know that $f_1(w_2) \leq f_1(U)$. This holds by the definition of $w_2$, which minimizes $f_1$. $f_1(w_2)$ should be less than $f_1(U) \forall \text{ } U \in W$ because $w_2$ is the minimizer of all elements in $W$ for $f_1$.
For $T=T-1$, we assume the following, which we define as our inductive hypothesis.
$$\text{Step 3: Use the inductive hypothesis to replace $U$ with $w_{T+1}$ because the hypothesis applies } \forall U \in W \text{, and $w_{T+1} \in W$}$$
We know that by our earlier definition of $w_{T+1}$, which is the decision from FTL, $w_{T+1}$ is the minimizer of the cumulative loss. Thus, we can say that any competitor $U$ will have a higher cumulative loss than that of $w_{T+1}$. We state the final step of this proof using this property:
Leader does not switch too much $\Sigma_{t=1}^T \mathbb{1} (w_t \neq w_{t+1}) \leq O(k) \implies R_T^{FTL} = O(1)$
Online Quadratic Optimization: OCO + adversary picks quadratic loss function $f_t(w) = \frac{1}{2} ||w - z_t||_2^2$.
Theorem for Online Quadratic Optimization: Assume $||z_t ||_2 \leq L$, then $R_T^{FTL} = O(L^2 log T)$.
Proof: Recall that we define $w_{t+1} = argmin_{w \in W} \Sigma_{i =1}^t f_i(w)$. In this case, we have that $w_{t + 1} = argmin_{w \in \mathbb{R}^d} \frac{1}{2} \Sigma_{i =1}^t ||w - z_i||_2^2$.
We solve the optimization problem using the first derivative test and find that
Let's consider the two terms inside the sum separately.
Let $A = ||w_t - z_t||_2^2$. To bound this result, we will rely on our initial assumption that $||z_t||_2 \leq L$ and the fact that each $w_t$ is an average of $z_i$'s.
We do not achieve the same bound as we did for the $P-OGD$ algorithm if our loss function $f_t$ is unstable. Consider the following example. Let $W = [-1, 1], f_1(w) = \frac{1}{2}w$ and $\forall t \in \lbrace 2 \dots T \rbrace$
Now, to compute the regret we see that $\Sigma_{t=1}^T f_t(w_t) = T$ and $min_w \Sigma_{t=1}^T f_t(w_t) = -\frac{1}{2}$. Therefore, $R_T^{FTL} = T + \frac{1}{2}$ which is linear in $T$ and over enough iterations our algorithm does not converge to the best decision.
Foreshadowing
In the next lecture, we will demonstrate a way to handle this bad case by "stabilizing" the function $f_t$. We will call this new algorithm Follow The Regularized Leader (FTRL) and choose each next decision $w_{t + 1}$ such that $w_{t + 1} = argmin_{w \in W} \Sigma_{i=1}^t f_i(w) + \eta R(w)$, where $R(w)$ is the regularization term.
Homework: Dynamic Regret
(Graded) In class, we learned how to get the regret guarantee of convex functions by P-OGD and FTL. Note that in regret, the competitor $w^*$ is fixed. In this problem, we consider a generalized notation of regret: dynamic regret, which is defined as
where $w_t^*\in W$ is a sequence of competitors and $$\sum_{t=2}^T ||w_t^*-w_{t-1}^*||_2\leq P_T.$$ For convenience we assume $0\in W$. Can you get the regret guarantee in terms of $G,D,T$, and $P_T$? What is the optimal $\eta$ in this scenario?
extending the square norm, and consider how to use the definitions of $D$ and $P_T$ to finish the telescoping sum. 2) When tuning $\eta$ in the final step, consider the upper bound as a function of $\eta$, and then try to find the optimal configuration.