CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 17: Online Gradient Descent (Cont'd)

March 7, 2023

Scribes: Junchen Lu, NAME2

Recap

Online Convex Optimization

For t = 1,2,...,T do:

Learner picks a decision $w_t$ from a convex set $K \subseteq \mathbb{R}^d$.
Simultaneously, environment chooses a convex loss function $f_t: \mathcal{K} \rightarrow \mathbb{R}^d$.
Learner observes $f_t$, suffers $f_t(w_t)$, and updates $w_t$.

End For

Regret

$$ Reg_T^A = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^Tf_t(w^\ast) $$

$$w^\ast = \arg \min_{w\in K}\left( \sum_{t=1}^Tf_t(w) \right)$$

Assumption 1: $\forall w,w' \in K, ||w-w'||_2 \leq D$

Assumption 2: $\forall t \in [T], ||\nabla f_t(w_t)||_2 \leq G$

OGD

Base case: $w_1$ is any point in $K$

For t = 1, 2, ..., T do:

$w_{t+1}' = w_t - \eta_t \nabla f_t(w_t)$
$w_{t+1} = \Pi_K(w_{t+1}')$

End For

Theorem: OGD with $\eta_t = \frac{D}{G \sqrt{T}}$ also leads to a bound $DG\sqrt{T}$.

Lower bound of OCO

Theorem: For any algorithm $A$, there exists an OCO problem which satisfies Assumption 1 and 2, such that the regret of $A$ on this problem is $Reg_T^A = \frac{1}{2} D G \sqrt{T}$

Proof:

Consider an online linear optimization problem:

In each round t:

Learner picks a decision $w_T \in K \subseteq \mathbb{R}^d$
Let $K = {w \vert \Vert w \Vert_2 \leq \frac{D}{2} }$ (Assumption 1)
Environment chooses $l_t \in \mathbb{R^d}$
Let $\Vert l_t \Vert_2 = G$, $l_t \perp w_t$, $l_t$ $\perp$ $\Sigma_{i=1}^{t-1} l_i$
Learner observes $l_t$ and suffers a loss $f_t(w_t)=w_t^T l_t$ and updates $w_t$.

$$Reg^A_T = \sum_{t=1}^T w_t^T l_t - \min_{w \in K} \sum_{t=1}^T w^T l_t$$

Since $l_t \perp w_t$, we have:

$$\sum_{t=1}^T w_t^T l_t = 0$$

Now we look at the second term:

$$ \begin{align*} {-}\min_{w \in K} \sum_{t=1}^T w^T l_t & = {-}\min_{w \in K} w^T \sum_{t=1}^T l_t \\ & = \max_{w \in K} w^T (- \sum_{t=1}^T l_t) \\ & = \left \Vert \sum_{t=1}^T l_t \right \Vert_2 \frac{D}{2} && \text{maximum is achieved at $w^\ast = \frac{D}{2} \frac{- \sum_{t=1}^{T} l_t}{\Vert \sum_{t=1}^{T} l_t \Vert_2}$} \\ \end{align*} $$

Thus,

$$ \begin{align*} Reg_T^A & = \frac{D}{2} \left \Vert \sum_{t=1}^T l_t \right \Vert_2 \\ \end{align*} $$

Consider $\left \Vert \Sigma_{t=1}^T l_t \right \Vert_2^2$ and we have the following:

$$ \begin{align*} \Vert \sum_{t=1}^T l_t \Vert_2^2 & = \Vert \sum_{t=1}^{T-1} l_t + l_T \Vert_2^2 \\ & = \left \Vert \sum_{t=1}^{T-1} l_t \right\Vert_2^2 + 2 l_T^T \left(\sum_{t=1}^{T-1} l_t\right) + \Vert l_T \Vert_2^2 \\ & = \left \Vert \sum_{t=1}^{T-1} l_t \right\Vert_2^2 + \Vert l_T \Vert_2^2 && \left(l_t \perp \sum_{i=1}^{t-1} l_i\right) \\ & \ldots \\ & = \sum_{t=1}^T \Vert l_t \Vert_2^2 && (\text{By induction}) \\ & = TG^2 \\ \end{align*} $$

We then have: $$\sqrt{\left \Vert \sum_{t=1}^T l_t \right \Vert_2^2} = \sqrt{ \sum_{t=1}^T \Vert l_t \Vert_2^2} = \sqrt{T}G$$

and thus $Reg_T^A = \frac{D}{2} \sqrt{T}G$.

Read section 4.2 of this paper for more details.

To get better bounds, we need more assumptions.

Upper bound of OGD for strongly convex functions

More assumptions

Assumption 1: Strong convexity

Recall that if $f: K \mapsto \mathbb{R}$ is $\lambda-SC$, then $\forall x,y \in K$,

$$f(y) \geq f(x) + \nabla f(x)^T (y-x) + \frac{\lambda}{2} \Vert y-x\Vert_2^2$$

Theorem: Suppose Assumption 2 holds and all loss functions are $\lambda-SC$, then OGD with $\eta_t = \frac{1}{\lambda t}$ achieves $Reg_T^A \leq \frac{G^2}{\lambda}(1+ \log T)$

Proof:

Step 1: $\forall t \in [T]$, find upper bound for $f_t(w_t) - f_t(w^\ast)$

$$ \begin{align*} f_t(w_t) - f_t(w^\ast) & \leq \nabla f_t(w_t)^T (w_t - w^\ast) - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 && (\lambda - SC)\\ & \leq \frac{\Vert w_t - w^\ast \Vert^2_2 - \Vert w_{t+1} - w^\ast \Vert^2_2}{2 \eta_t} + \frac{\eta_t}{2}G^2 - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \\ \end{align*} $$

The last inequality follows from our previous result for convex functions.

Step 2: find $Reg^{\text{OGD}}_T$

$$ \begin{align*} Reg_T^{\text{OGD}} & = \sum_{t=1}^T (f_t(w_t) - f_t(w^\ast)) \\ & = \sum_{t=1}^T \left( \frac{\Vert w_t - w^\ast \Vert^2_2 - \Vert w_{t+1} - w^\ast \Vert^2_2}{2 \eta_t} + \frac{\eta_t}{2}G^2 - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \right) \\ & = \sum_{t=2}^T \Vert w_t - w^\ast \Vert_2^2 \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}\right) + \Vert w_1 - w^\ast \Vert_2^2 \frac{1}{2\eta_1} - \Vert w_{T+1} - w^\ast \Vert_2^2 \frac{1}{2\eta_T} + G^2\sum_{t=1}^T \frac{\eta_t}{2} - \sum_{t=1}^T \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \\ & = \sum_{t=2}^T \Vert w_t - w^\ast \Vert_2^2 \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}-\frac{\lambda}{2}\right) + \Vert w_1 - w^\ast \Vert_2^2 \frac{1}{2\eta_1} - \Vert w_{T+1} - w^\ast \Vert_2^2 \frac{1}{2\eta_T} + G^2\sum_{t=1}^T \frac{\eta_t}{2} -\frac{\lambda}{2}\Vert w_1 - w^\ast \Vert^2_2 \end{align*} $$

Note that $\eta_t = 1/(\lambda t)$. Thus, $\frac{1}{2\eta_t}-\frac{1}{2\eta_{t-1}}-\frac{\lambda}{2}=(\lambda t - \lambda (t-1) -\lambda)/2=0.$ Moreover, $\frac{1}{2\eta_1}-\frac{\lambda}{2}=0.$ Thus, we get

$$ \begin{align*} Reg_T^{\text{OGD}} & = \sum_{t=1}^T (f_t(w_t) - f_t(w^\ast)) \\ & \leq G^2\sum_{t=1}^T\frac{\eta_t}{2}=\frac{G^2}{2\lambda}\sum_{t=1}^T\frac{1}{t}\leq \frac{G^2}{\lambda}(\log T+1). \end{align*} $$

Read section 3.3 of this book for more.

CS7545_Sp23_Lecture_17: Online Gradient Descent 2 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 17: Online Gradient Descent (Cont'd)

March 7, 2023

Scribes: Junchen Lu, NAME2

Recap

Online Convex Optimization

Regret

OGD

Lower bound of OCO

Upper bound of OGD for strongly convex functions

More assumptions

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp23_Lecture_17: Online Gradient Descent 2 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 17: Online Gradient Descent (Cont'd)

March 7, 2023

Scribes: Junchen Lu, NAME2

Recap

Online Convex Optimization

Regret

OGD

Lower bound of OCO

Upper bound of OGD for strongly convex functions

More assumptions

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️