CS 7545: Machine Learning Theory -- Spring 2023
Instructor: Guanghui Wang
Notes for Lecture 17: Online Gradient Descent (Cont'd)
Scribes: Junchen Lu, NAME2
Online Convex Optimization
For t = 1,2,...,T do:
Learner picks a decision $w_t$ from a convex set $K \subseteq \mathbb{R}^d$ .
Simultaneously, environment chooses a convex loss function $f_t: \mathcal{K} \rightarrow \mathbb{R}^d$ .
Learner observes $f_t$ , suffers $f_t(w_t)$ , and updates $w_t$ .
End For
$$
Reg_T^A = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^Tf_t(w^\ast)
$$
$$w^\ast = \arg \min_{w\in K}\left( \sum_{t=1}^Tf_t(w) \right)$$
Assumption 1 : $\forall w,w' \in K, ||w-w'||_2 \leq D$
Assumption 2 : $\forall t \in [T], ||\nabla f_t(w_t)||_2 \leq G$
Base case: $w_1$ is any point in $K$
For t = 1, 2, ..., T do:
$w_{t+1}' = w_t - \eta_t \nabla f_t(w_t)$
$w_{t+1} = \Pi_K(w_{t+1}')$
End For
Theorem : OGD with $\eta_t = \frac{D}{G \sqrt{T}}$ also leads to a bound $DG\sqrt{T}$ .
Theorem : For any algorithm $A$ , there exists an OCO problem which satisfies Assumption 1 and 2, such that the regret of $A$ on this problem is $Reg_T^A = \frac{1}{2} D G \sqrt{T}$
Proof:
Consider an online linear optimization problem:
In each round t:
Learner picks a decision $w_T \in K \subseteq \mathbb{R}^d$
Let $K = {w \vert \Vert w \Vert_2 \leq \frac{D}{2} }$ (Assumption 1)
Environment chooses $l_t \in \mathbb{R^d}$
Let $\Vert l_t \Vert_2 = G$ , $l_t \perp w_t$ , $l_t$ $\perp$ $\Sigma_{i=1}^{t-1} l_i$
Learner observes $l_t$ and suffers a loss $f_t(w_t)=w_t^T l_t$ and updates $w_t$ .
$$Reg^A_T = \sum_{t=1}^T w_t^T l_t - \min_{w \in K} \sum_{t=1}^T w^T l_t$$
Since $l_t \perp w_t$ , we have:
$$\sum_{t=1}^T w_t^T l_t = 0$$
Now we look at the second term:
$$
\begin{align*}
{-}\min_{w \in K} \sum_{t=1}^T w^T l_t & = {-}\min_{w \in K} w^T \sum_{t=1}^T l_t \\
& = \max_{w \in K} w^T (- \sum_{t=1}^T l_t) \\
& = \left \Vert \sum_{t=1}^T l_t \right \Vert_2 \frac{D}{2} && \text{maximum is achieved at $w^\ast = \frac{D}{2} \frac{- \sum_{t=1}^{T} l_t}{\Vert \sum_{t=1}^{T} l_t \Vert_2}$} \\
\end{align*}
$$
Thus,
$$
\begin{align*}
Reg_T^A & = \frac{D}{2} \left \Vert \sum_{t=1}^T l_t \right \Vert_2 \\
\end{align*}
$$
Consider $\left \Vert \Sigma_{t=1}^T l_t \right \Vert_2^2$ and we have the following:
$$
\begin{align*}
\Vert \sum_{t=1}^T l_t \Vert_2^2 & = \Vert \sum_{t=1}^{T-1} l_t + l_T \Vert_2^2 \\
& = \left \Vert \sum_{t=1}^{T-1} l_t \right\Vert_2^2 + 2 l_T^T \left(\sum_{t=1}^{T-1} l_t\right) + \Vert l_T \Vert_2^2 \\
& = \left \Vert \sum_{t=1}^{T-1} l_t \right\Vert_2^2 + \Vert l_T \Vert_2^2 && \left(l_t \perp \sum_{i=1}^{t-1} l_i\right) \\
& \ldots \\
& = \sum_{t=1}^T \Vert l_t \Vert_2^2 && (\text{By induction}) \\
& = TG^2 \\
\end{align*}
$$
We then have:
$$\sqrt{\left \Vert \sum_{t=1}^T l_t \right \Vert_2^2} = \sqrt{ \sum_{t=1}^T \Vert l_t \Vert_2^2} = \sqrt{T}G$$
and thus $Reg_T^A = \frac{D}{2} \sqrt{T}G$ .
Read section 4.2 of this paper for more details.
To get better bounds, we need more assumptions.
Upper bound of OGD for strongly convex functions
Assumption 1 : Strong convexity
Recall that if $f: K \mapsto \mathbb{R}$ is $\lambda-SC$ , then $\forall x,y \in K$ ,
$$f(y) \geq f(x) + \nabla f(x)^T (y-x) + \frac{\lambda}{2} \Vert y-x\Vert_2^2$$
Theorem :
Suppose Assumption 2 holds and all loss functions are $\lambda-SC$ , then OGD with $\eta_t = \frac{1}{\lambda t}$ achieves $Reg_T^A \leq \frac{G^2}{\lambda}(1+ \log T)$
Proof :
Step 1 : $\forall t \in [T]$ , find upper bound for $f_t(w_t) - f_t(w^\ast)$
$$
\begin{align*}
f_t(w_t) - f_t(w^\ast) & \leq \nabla f_t(w_t)^T (w_t - w^\ast) - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 && (\lambda - SC)\\
& \leq \frac{\Vert w_t - w^\ast \Vert^2_2 - \Vert w_{t+1} - w^\ast \Vert^2_2}{2 \eta_t} + \frac{\eta_t}{2}G^2 - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \\
\end{align*}
$$
The last inequality follows from our previous result for convex functions.
Step 2: find $Reg^{\text{OGD}}_T$
$$
\begin{align*}
Reg_T^{\text{OGD}} & = \sum_{t=1}^T (f_t(w_t) - f_t(w^\ast)) \\
& = \sum_{t=1}^T \left( \frac{\Vert w_t - w^\ast \Vert^2_2 - \Vert w_{t+1} - w^\ast \Vert^2_2}{2 \eta_t} + \frac{\eta_t}{2}G^2 - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \right) \\
& = \sum_{t=2}^T \Vert w_t - w^\ast \Vert_2^2 \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}\right) + \Vert w_1 - w^\ast \Vert_2^2 \frac{1}{2\eta_1} - \Vert w_{T+1} - w^\ast \Vert_2^2 \frac{1}{2\eta_T} + G^2\sum_{t=1}^T \frac{\eta_t}{2} - \sum_{t=1}^T \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \\
& = \sum_{t=2}^T \Vert w_t - w^\ast \Vert_2^2 \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}-\frac{\lambda}{2}\right) + \Vert w_1 - w^\ast \Vert_2^2 \frac{1}{2\eta_1} - \Vert w_{T+1} - w^\ast \Vert_2^2 \frac{1}{2\eta_T} + G^2\sum_{t=1}^T \frac{\eta_t}{2} -\frac{\lambda}{2}\Vert w_1 - w^\ast \Vert^2_2
\end{align*}
$$
Note that $\eta_t = 1/(\lambda t)$ . Thus, $\frac{1}{2\eta_t}-\frac{1}{2\eta_{t-1}}-\frac{\lambda}{2}=(\lambda t - \lambda (t-1) -\lambda)/2=0.$ Moreover, $\frac{1}{2\eta_1}-\frac{\lambda}{2}=0.$ Thus, we get
$$
\begin{align*}
Reg_T^{\text{OGD}} & = \sum_{t=1}^T (f_t(w_t) - f_t(w^\ast)) \\
& \leq G^2\sum_{t=1}^T\frac{\eta_t}{2}=\frac{G^2}{2\lambda}\sum_{t=1}^T\frac{1}{t}\leq \frac{G^2}{\lambda}(\log T+1).
\end{align*}
$$
Read section 3.3 of this book for more.