Notes for Lecture 15: Online Convex Optimization & Gradient Descent
March 7, 2024
Scribes: Zach Perlman
Online Learning Framework
Consider an online learning process over discrete time steps $t = 1, 2, \ldots, T$. At each time step $(t)$, the learner selects a decision $w_t \in W \subset \mathbb{R}^d$. Subsequently, an adversary selects a loss function $f_t: W \rightarrow \mathbb{R}$. Upon selecting $w_t$, the learner observes $f_t$ and incurs a loss $f_t(w_t)$, followed by an update to $w_t$.
Performance Measure: Regret
The performance of the learner is measured using the concept of regret, defined as
where $w^* = \arg\min_{w \in W} \sum_{t=1:T} f_t(w)$ represents the best fixed decision in hindsight. The goal is to achieve sublinear regret, such that $R_T = O(\sqrt{T}))$ or $O(\log T)$, indicating that the average regret per time step vanishes as $T \rightarrow \infty$.
Online Convex Optimization
Assumptions
We make the following assumptions in the context of Online Convex Optimization (OCO):
The decision set $W$ is convex, and for all $t \in$ { $1, 2, \ldots, T$ }, the loss functions $f_t$ are convex.
The set $W$ is bounded, i.e., for all $w, w' \in W, |w-w'|_2 \leq D$.
The gradients of the loss functions are bounded, i.e., for all $t \in$ { $1, 2, \ldots, T$ }, and for all $w \in W, |\nabla f_t(w)|_2 \leq G$.
Online Gradient Descent (OGD)
In the offline (convex) optimization setting, to minimize a function $g(w)$, we iteratively update our decision based on the gradient descent rule: $w_{t+1} = w_t - \eta_t \nabla g(w_t)$. Inspired by this, in the online setting, we update the decision as
$$
w'_{t+1} = w_t - \eta \nabla f_t(w_t),
$$
and then project $w'_{t+1}$ back into the decision set W if necessary, using the projection
Step 1: Show that for all $t \in$ { $1, 2, \ldots, T$ }, we have $f_t(w_t) - f_t(w^*) \leq m(t)$, where $m(t)$ will be defined based on the convexity of $f_t$ and the update rule of P-OGD.
Step 2: Demonstrate that $R_T^{\text{P-OGD}} \leq \sum_{t=1:T} m(t) \leq GD\sqrt{T}$.
Detailed Proof
Step 1: Convexity and Update Rule
Given the convexity of $f_t$, for any $x, y \in W$, we have:
(Graded) In class, we proved the regret bound of P-OGD. In the last step of the proof, we showed that the regret is upper bounded by (Last line of Step 2 in the proof):
$${R}^{\text{POGD}}_T\leq \frac{\eta G^2T}{2}+\frac{D^2}{2\eta}.$$ By optimally tuning $\eta$ as $\eta=\frac{D}{G\sqrt{T}}$, we obtain a bound of the form $DG\sqrt{T}$. The problem with this approach is that it requires us to know $T$ in advance. Is there a way around this issue?
Imagine constructing a modified algorithm $\mathcal{A}$ that does the following iterative procedure. $\mathcal{A}$ starts with an initial parameter $\eta_1$, implements the P-OGD algorithm using this step size for $T_1$ rounds, then adjusts the parameter to $\eta_2$, and runs $\mathcal{A}$ for $T_2$ rounds, and so forth. Let's say $\eta$ gets updated $k$ times, where $T_1 + T_2 + \ldots + T_k = T$.
Can you construct a schedule for the sequence of $(\eta_i, T_i)$ that achieves the same order of bound as the optimally tuned bound (namely, $O(DG\sqrt{T})$), even though $T$ is unknown in advance? In other words, you want to choose the sequence of $T_1, T_2, \ldots$, with the associated $\eta_1, \eta_2, \ldots$ so that whenever the online learning procedure truly ends, at a previously unknown $T$, the bound $R_T^{\mathcal{A}} = O(DG\sqrt{T})$ will always hold. (Hint: this is referred to as a ``doubling trick''.)