CS7545_Sp23_Lecture_16: Online Gradient Descent - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 16: Online Gradient Descent

March 02, 2023

Scribes: Chaewon Park, Dunstan Becht

Announcements

  1. Updates on Final project
    • List of 20-40 papers(topics) will be given for you to choose from, but if you find something related but not in the list, that's also fine
    • Pick 1-2 papers, summarize and do analysis (e.g., How does this class apply to modern researches)
    • Two (or may be three) people can work together
    • No presentation
  2. Homework 3
    • Likely to be due before Spring break but give grace period until the end of Spring break
    • No office hours over the break!
  3. Exam
    • Will cover topics in HW1~HW3
    • Exam will be within the bounds of HWs and it will test the same thing but in different language
    • As long as you understand HW solutions perfectly, you will be fine on the exam

Online Learning (cont'd)

Recap: Online Convex Optimization

Framework

For t = 1, 2, ..., T do:

  1. Learner picks a decision $w_t$ from a convex set $K \subseteq \mathbb{R}^d$. Simultaneously, environment chooses a convex loss function $f_t: \mathcal{K} \rightarrow \mathbb{R}^d$.
  2. Learner observes $f_t$, suffers $f_t(w_t)$, and updates $w_t$.

End For

Performance Measure: Regret

Definition: For an algorithm A, the regret is:

$$ Reg_T^A = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^Tf_t(w^*) $$

where $$w^* = \arg \min_{w\in K}\left( \sum_{t=1}^Tf_t(w) \right)$$

Goal: We want to design an algorithm that gives:

$$ Reg_T^A = o(T) \qquad $$

for example: $O\left(\sqrt{T}\right)$ or $O(\log T)$, so that:

$$ \lim_{T \to \infty}\frac{1}{T}Reg_T^A = 0 $$

More standard assumptions

  • Assumption 1: $\forall w, w' \in K, \Vert w- w'\Vert_2 \le D$ where $D$ is a constant value.
  • Assumption 2: $\forall t \in [T], \forall w \in K, \Vert \nabla f_t(w) \Vert_2 \le G$ where $G$ is another constant value.

Online Gradient Descent (OGD)

Idea:

$$OGD: w_ {t+1} \leftarrow w_ t - \eta_ t \nabla f_ t (w_ t)$$

But such an update could violate assumption 1. We use a projection operator to fix this.

Projection Operator

Definition: $\forall K \subseteq \mathbb{R}^d, \forall w \in \mathbb{R}^d$,

$$ \Pi_K(w) = \arg\min_{y\in K} \Vert y-w\Vert_2 $$

(i.e., $\Pi_K(w)$ is the point in $K$ that is closest to $w$.)

Note: If $w$ is already in $K$, $\Pi_K(w)$ is $w$ itself.

Lemma 1:

Let $K \subseteq \mathbb{R}^d$ be a convex set, and $w\in K$. Then, let $z \in \mathbb{R}^d, z' = \Pi_K(z)$. Then the following inequality holds:

$$ \Vert z-w \Vert_2 \ge \Vert z' - w\Vert_2 $$

Proof: See Homework 1 Question 3(b) solution here

OGD (real)

Base case: $w_1$ is any point in $K$

For t = 1, 2, ..., T do:

  1. $w_{t+1}' = w_t - \eta_t \nabla f_t(w_t)$
  2. $w_{t+1} = \Pi_K(w_{t+1}')$

End For

Theorem: Suppose assumption 1 and 2 hold, then OGD with $\eta_t = \frac{D}{G\sqrt{T}}$ achieves the following regret bound:

$$ Reg_T^{OGD} \le GD\sqrt{T} $$

Proof:

Goal:

$$Reg_T^{OGD} = \sum_{t=1}^T \left( f_t(w_t) - f_t( w^* ) \right) \leq GD\sqrt{T}$$

Proof idea: Split this into two steps as follows:

  • Step 1: $\forall t \in [T], f_t(w_t) - f_t(w^*) \leq g(t)$

  • Step 2: $Reg_T^{OGD} \leq \sum_{t=1}^T g(t) \leq GD\sqrt{T}$

where $g$ is a function of $t$.

Step 1:

$$ \begin{align*} f_t(w_t) - f_t( w^* ) &\leq \nabla f_t^T ( w_t - w^* ) && \text{(First-order condition for convexity)} \\ & = \frac{1}{\eta_t} (w_t - w_{t+1}')^T ( w_t - w^* ) && (\text{Line 1 of OGD(real)}) \\ & = \frac{1}{2\eta_t} \left[ (\Vert w_t - w_{t+1}' \Vert_2)^2 + (\Vert w_t - w^* \Vert_2)^2 - (\Vert w_{t+1}' - w^* \Vert_2)^2 \right] && (\forall x, y \in \mathbb{R}^d, 2x^Ty = \Vert x \Vert_2^2 + \Vert y \Vert_2^2 - \Vert x-y \Vert_2^2) \\ &\leq \frac{1}{2\eta_t} \left[ (\Vert w_t - w_{t+1}' \Vert_2)^2 + (\Vert w_t - w^* \Vert_2)^2 - (\Vert w_{t+1} - w^* \Vert_2)^2 \right] && \text{(Lemma 1)} \\ &= \frac{1}{2\eta_t} \left[ (\Vert \eta_t \nabla f_t(w_t) \Vert_2)^2 + (\Vert w_t - w^* \Vert_2)^2 - (\Vert w_{t+1} - w^* \Vert_2)^2 \right] \\ &\leq \frac{\eta_t}{2} (\Vert \nabla f_t(w_t) \Vert_2)^2 + \frac{(\Vert w_t - w^* \Vert_2)^2 - (\Vert w_{t+1} - w^* \Vert_2)^2}{2\eta_t} \\ &\leq \frac{\eta_t G^2}{2} + \frac{(\Vert w_t - w^* \Vert_2)^2 - (\Vert w_{t+1} - w^* \Vert_2)^2}{2\eta_t} && \text{(Assumption 2)} \end{align*} $$

Step 2: Note that we set $\eta_t$ as time-invariant. Let $\eta_t=\eta$ for all $t\in[T]$. We have

$$ \begin{align*} Reg_T^{OGD} &\leq \sum_{t=1}^T \frac{\eta}{2}G^2 + \sum_{t=1}^T \frac{(\Vert w_t - w^* \Vert_2)^2 - (\Vert w_{t+1} - w^* \Vert_2)^2}{2\eta} \\ &\leq \frac{T G^2 \eta}{2} + \frac{(\Vert w_1 - w^* \Vert_2)^2 - (\Vert w_{T+1} - w^* \Vert_2)^2}{2\eta_t} && \text{(Telescoping sum)} \\ &\leq \frac{T G^2 \eta}{2} + \frac{D^2}{2\eta} && \text{(Assumption 1)} \end{align*} $$

Then substitute $\eta = \frac{D}{G\sqrt{T}}$.

Note:

OGD with $\eta_t = \frac{D}{G \sqrt{T}}$ also leads to an $O(DG\sqrt{T})$ regret bound. This algorithm is "timeless" as it does not need to know $T$ in advance.

⚠️ **GitHub.com Fallback** ⚠️