CS7545_Sp24_Lecture_15: Online Convex Optimization & Gradient Descent - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Guanghui Wang

Notes for Lecture 15: Online Convex Optimization & Gradient Descent

March 7, 2024

Scribes: Zach Perlman

Online Learning Framework

Consider an online learning process over discrete time steps $t = 1, 2, \ldots, T$. At each time step $(t)$, the learner selects a decision $w_t \in W \subset \mathbb{R}^d$. Subsequently, an adversary selects a loss function $f_t: W \rightarrow \mathbb{R}$. Upon selecting $w_t$, the learner observes $f_t$ and incurs a loss $f_t(w_t)$, followed by an update to $w_t$.

Performance Measure: Regret

The performance of the learner is measured using the concept of regret, defined as

$$ R_T = \sum_{t=1}^{T} f_t(w_t) - \sum_{t=1}^{T} f_t(w^*), $$

where $w^* = \arg\min_{w \in W} \sum_{t=1:T} f_t(w)$ represents the best fixed decision in hindsight. The goal is to achieve sublinear regret, such that $R_T = O(\sqrt{T}))$ or $O(\log T)$, indicating that the average regret per time step vanishes as $T \rightarrow \infty$.

Online Convex Optimization

Assumptions

We make the following assumptions in the context of Online Convex Optimization (OCO):

  1. The decision set $W$ is convex, and for all $t \in$ { $1, 2, \ldots, T$ }, the loss functions $f_t$ are convex.
  2. The set $W$ is bounded, i.e., for all $w, w' \in W, |w-w'|_2 \leq D$.
  3. The gradients of the loss functions are bounded, i.e., for all $t \in$ { $1, 2, \ldots, T$ }, and for all $w \in W, |\nabla f_t(w)|_2 \leq G$.

Online Gradient Descent (OGD)

In the offline (convex) optimization setting, to minimize a function $g(w)$, we iteratively update our decision based on the gradient descent rule: $w_{t+1} = w_t - \eta_t \nabla g(w_t)$. Inspired by this, in the online setting, we update the decision as

$$ w'_{t+1} = w_t - \eta \nabla f_t(w_t), $$

and then project $w'_{t+1}$ back into the decision set W if necessary, using the projection

$$ \Pi_W(w'_{t+1}) = \arg\min_{u \in W} |u-w'_{t+1}|_2. $$

Lemma: Let $K \subset \mathbb{R}^d$ be a convex set, $w \in K$, and $z \in \mathbb{R}^d$ with $z' = \Pi_K(z)$ (the projection of $z$ onto $K)$. Then

$$ |z-w| \geq |z'-w|. $$

Projected Online Gradient Descent (P-OGD)

The update rule for P-OGD is given by

$$ w_{t+1} = \Pi_W(w'_{t+1}), $$

where $w'_{t+1} = w_t - \eta \nabla f_t(w_t)$. With $\eta = \frac{D}{G\sqrt{T}}$, the regret bound is

$$ R_T^{\text{P-OGD}} \leq GD\sqrt{T}. $$

Proof of Regret Bound for P-OGD

Goal

We aim to demonstrate that the regret of P-OGD, denoted as $R_T^{\text{P-OGD}}$, satisfies:

$$ R_T^{\text{P-OGD}} = \sum_{t=1}^{T} (f_t(w_t) - f_t(w^*)) \leq GD\sqrt{T}. $$

Proof Outline

The proof consists of two main steps:

Step 1: Show that for all $t \in$ { $1, 2, \ldots, T$ }, we have $f_t(w_t) - f_t(w^*) \leq m(t)$, where $m(t)$ will be defined based on the convexity of $f_t$ and the update rule of P-OGD.

Step 2: Demonstrate that $R_T^{\text{P-OGD}} \leq \sum_{t=1:T} m(t) \leq GD\sqrt{T}$.

Detailed Proof

Step 1: Convexity and Update Rule

Given the convexity of $f_t$, for any $x, y \in W$, we have:

$$ f(y) \geq f(x) + \langle \nabla f(x), y-x \rangle. $$

From the P-OGD update rule, it follows that:

$$ \nabla f_t(w_t) = \frac{1}{\eta} (w_t - w'_{t+1}). $$

Hence, we obtain:

$$\begin{align*} f_t(w_t) - f_t(w^{*}) &\leq \langle \nabla f_t(w_t), w_t - w^{*} \rangle \\ &=\frac{1}{\eta} \langle w_t - w'_{t+1}, w_t - w^{*} \rangle \\ \end{align*}$$

$$=\frac{1}{2\eta} (||w_t - w'_{t+1}||^2 + ||w_t - w^{*}||^2$$

$$- ||w'_{t+1} - w^{*}||^2).$$

Applying the lemma and the update rule, this can be simplified to:

$$ f_t(w_t) - f_t(w^*) \leq \frac{\eta G^2}{2} + \frac{1}{2\eta} ( ||w_t - w^{*}||^2 - ||w_{t+1} - w^{*}||^2 ) = m(t). $$

Step 2: Summation and Regret Bound

Summing $m(t)$ over all $T$ steps and applying the telescoping property, we get:

$$\begin{align*} R_T &\leq \sum_{t=1}^{T} m(t) \\ &\leq \frac{\eta G^2 T}{2} + \frac{1}{2\eta} ( ||w_1 - w^*||^2 - ||w_{T+1} - w^*||^2) \\ &\leq \frac{\eta G^2 T}{2} + \frac{D^2}{2\eta}. \end{align*} $$

Choosing $\eta^* = D/(G\sqrt{T})$, we conclude:

$$R_T^{\text{P-OGD}} \leq GD\sqrt{T}.$$

Homework: The Doubling trick

(Graded) In class, we proved the regret bound of P-OGD. In the last step of the proof, we showed that the regret is upper bounded by (Last line of Step 2 in the proof): $${R}^{\text{POGD}}_T\leq \frac{\eta G^2T}{2}+\frac{D^2}{2\eta}.$$ By optimally tuning $\eta$ as $\eta=\frac{D}{G\sqrt{T}}$, we obtain a bound of the form $DG\sqrt{T}$. The problem with this approach is that it requires us to know $T$ in advance. Is there a way around this issue?

Imagine constructing a modified algorithm $\mathcal{A}$ that does the following iterative procedure. $\mathcal{A}$ starts with an initial parameter $\eta_1$, implements the P-OGD algorithm using this step size for $T_1$ rounds, then adjusts the parameter to $\eta_2$, and runs $\mathcal{A}$ for $T_2$ rounds, and so forth. Let's say $\eta$ gets updated $k$ times, where $T_1 + T_2 + \ldots + T_k = T$.

Can you construct a schedule for the sequence of $(\eta_i, T_i)$ that achieves the same order of bound as the optimally tuned bound (namely, $O(DG\sqrt{T})$), even though $T$ is unknown in advance? In other words, you want to choose the sequence of $T_1, T_2, \ldots$, with the associated $\eta_1, \eta_2, \ldots$ so that whenever the online learning procedure truly ends, at a previously unknown $T$, the bound $R_T^{\mathcal{A}} = O(DG\sqrt{T})$ will always hold. (Hint: this is referred to as a ``doubling trick''.)

⚠️ **GitHub.com Fallback** ⚠️