CS7545_Sp24_Lecture_17 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Guanghui Wang

Notes for Lecture 17: Online Learning

March 14, 2024

Scribes: Weilong Shen, Yuzhou Wang

Recap

Online Convex Optimization

For t = 1...T do:

  1. Learner picks $w_{t}$ from convex set $W \subseteq \mathbb{R}^d$. Adversary picks convex loss $f_t: W → \mathbb{R}$.
  2. Learner observes $f_t$ and suffers $f_t(w_t)$.

End for

Regret

$$R_T(u) = \sum_{t=1}^Tf_t(w_t)-\sum_{t=1}^Tf_t(u), \forall u\in W$$

$$R_T(w^*)= \sum_{t=1}^Tf_t(w_t)-\sum_{t=1}^Tf_t(w^*), w^* =argmin_{w \in W}\sum_{t=1}^Tf_t(w)$$

The two forms of regret are equivalent, as shown in the last lecture.

Algorithm 1. Projected Online Gradient Descent (P-OGD)

$$\begin{cases} w_{t+1}^{'} = w_t - \eta \cdot \nabla f_t(w_t)\\ w_{t+1} = \Pi_W[w_{t+1}^{'}] \end{cases}$$

Theorem:

If $\eta = \dfrac{D}{G\sqrt{T}}$ then $R_{t}^{P-OGD} = O(DG\sqrt{T})$, where $D$ is the upper bound of the decision set and $G$ is the upper bound of the gradient.

Algorithm 2. Follow-The-Leader (FTL)

$$w_{t+1} =argmin_{w \in W}\sum_{t=1}^Tf_i(w)$$

Lemma: Be-The-Leader (BTL):

$$R_{T}^{FTL}\leq \sum_{t=1}^T[f_t(w_t)-f_t(w_{t+1})]$$

Good cases of the FTL algorithm include:

  1. $f_1 = f_2 = ... = f_T$
  2. $\Sigma_{t=1}^{T}\mathbb{1}[w_t\neq w_{t+1}]\leq O(k)$
  3. Online Quadratic Optimization (OQO): $f_t(w) = \frac{1}{2}||w-z_t||_{2}^{2}$, $R_T^{FTL}=O(\log T)$

In the worst case of the FTL algorithm, $R_T^{FTL}=\Theta(T)$

This lecture

Algorithm 3. Follow the Regularized Leader (FTRL)

$$w_{t+1} = argmin_{w \in W} + \eta\cdot R(w)$$

where $R(w)$ is the regularization term, in many cases $R(w) = \frac{1}{2}||w||_2^2$.

We would like to analyze the algorithm by bounding its regret. First we need to do some preparation.

Assume without loss of generality that $R(w)$ is 1-strongly convex w.r.t. some norm $||\cdot||$

Recall that for any $\lambda>0$, we say a differentiable function $l: W → \mathbb{R}$ is $\lambda$-strongly convex w.r.t. $||\cdot||$ if $$\forall w, w^{'} \in W, l(w^{'})\geq l(w) + \langle \nabla l(w), w^{'} - w \rangle + \dfrac{\lambda}{2}||w^{'} - w||^2$$

Observation: If $w = argmin_{w \in W}l(w)$, then:

$$\forall w^{'} \in W, l(w^{'})\geq l(w) + \dfrac{\lambda}{2}||w^{'} - w||^2$$

Then we make two assumptions.

Assumption 1: Bounded Sets $$\forall w, w^{'} \in W, R(w) - R(w^{'})\leq D_R^2$$

Assumption 2: Bounded Gradients $$\forall w\in W, \forall t\geq 1, \left|\nabla f_t(w)\right|_{*}\leq G$$

where $||\cdot||$ and $||\cdot||_*$ are dual norms.

**Remark $D_R^2$ is some constant depends on $R(\omega)$, and $G$ is also some constant.

Theorem:

Suppose Assumptions 1 and 2 hold, then FTRL with $\eta = \dfrac{G}{D_R\sqrt{T}}$ achieves: $$R_{T}^{FTRL} = O\left(D_RG\sqrt{T}\right)$$

Proof: We first define a new OCO problem, which features: $f_0(w) = \eta\cdot R(w)$, and $f_1, f_2, ..., f_T$ remain the same. We call this problem $P'$ while the original OCO problem $P$.

Apply FTL on $P'$: $$w_{t+1} = argmin_{w \in W} \sum_{i=0}^{t}f_i(w)$$

$$= argmin_{w \in W} \sum_{i=1}^{t}f_i(w) + \eta\cdot R(w)$$ which shows us that applying FTL on $P'$ is equivalent to applying FTRL on $P$.

By the definition of regret we have:

$$R_{T}^{FTL, P^{'}} = \sum_{t=0}^{T}f_t(w_t)-\sum_{t=0}^{T}f_t(u)$$

$$= \sum_{t=1}^{T}f_t(w_t)-\sum_{t=1}^{T}f_t(u)+f_0(w_0)-f_0(u)$$

$$= R_{T}^{FTRL, P} + \eta\cdot R(w_0) - \eta\cdot R(u)$$

From the BTL Lemma we have: $$R_{T}^{FTL, P^{'}} \leq \sum_{t=0}^{T}[f_t(w_t)-f_t(w_{t+1})]$$

Therefore, $$R_{T}^{FTRL, P} \leq \sum_{t=0}^{T}[f_t(w_t)-f_t(w_{t+1})] - \eta\cdot [R(w_0)-R(u)]$$

The second term can be bounded by $\eta\cdot D_R^2$.

Now for any $t\geq 1$, by convexity we have

$$\begin{align*} f_{t}(w_t)-f_{t}(w_{t+1}) &\leq \langle \nabla f_t(w_t), w_t - w_{t+1}\rangle \\\ &\leq \left\|\nabla f_t(w_t) \right\| \cdot \| w_t - w_{t+1}\| \\\ &\leq G \| w_t - w_{t+1}\|. \end{align*}$$

Now let $$F_t(w) = \sum_{i=1}^tf_i(w)+ \eta R(w),$$ it is easy to check that $F_t(w)$ is $\eta$-strongly convex. Note that by definition $w_{t+1} = argmin_{w \in W}F_{t}(w)$. By previous observation, $\forall w' \in W$ we have

$$\begin{align*} F_t(w')-F_t(w_{t+1}) \geq \frac{\eta}{2}\|w'-w_{t+1}\|^{2} \end{align*}$$

Plug in $w' = w_t$, we have

$$\begin{align*} F_t(w_t)-F_t(w_{t+1}) \geq \frac{\eta}{2}\|w_t-w_{t+1}\|^{2} \end{align*}$$

Similarly, $\forall w' \in W$ we have

$$\begin{align*} F_{t-1}(w')-F_{t-1}(w_{t}) \geq \frac{\eta}{2}\|w'-w_{t}\|^{2} \end{align*}$$

and Plug in $w' = w_{t+1}$, we have

$$\begin{align*} F_{t-1}(w_{t+1})-F_{t-1}(w_t) \geq \frac{\eta}{2}\|w_{t+1}-w_{t}\|^{2} \end{align*}$$

Sum up those two inequalities gives

$$\begin{align*} \eta \|w_{t+1}-w_{t}\|^{2} &\leq F_t(w_t)-F_t(w_{t+1})+F_{t-1}(w_{t+1})-F_{t-1}(w_t) \\\ &= \Big(F_{t}(w_t)- F_{t-1}(w_t)\Big)-\Big(F_{t}(w_{t+1})- F_{t-1}(w_{t+1})\Big) \\\ &= f_{t}(w_t)-f_{t}(w_{t+1}) \end{align*}$$

That is

$$\begin{align*} \|w_{t+1}-w_{t}\| \leq \sqrt{\frac{1}{\eta}\big(f_t(w_t)-f_t(w_{t+1})\big)} \end{align*}$$

Combining the inequality from convexity, we have

$$\begin{align*} f_t(w_t)-f_t(w_{t+1}) \leq G \|w_{t+1}-w_{t}\| \leq G\sqrt{\frac{1}{\eta}\big(f_t(w_t)-f_t(w_{t+1})\big)}, \end{align*}$$

Which gives

$$\begin{align*} f_t(w_t)-f_t(w_{t+1}) \leq \frac{G^2}{\eta}. \end{align*}$$

Now put everything together,

$$\begin{align*} R_{T}^{FTRL, P} &\leq \sum_{t=0}^{T}[f_t(w_t)-f_t(w_{t+1})] - \eta \big(R(w_0)-R(u)\big) \\\ &\leq \sum_{t=1}^{T}[f_t(w_t)-f_t(w_{t+1})] + \big(f_0(w_0)-f_0(w_{1})\big) + \eta D_R^2 \\\ &\leq \sum_{t=1}^{T}[f_t(w_t)-f_t(w_{t+1})] + \big(R(w_0)-R(w_{1})\big) + \eta D_R^2 \\\ &\leq T \cdot \frac{G^2}{\eta}+ 2\eta D_R^2. \\\ \end{align*}$$

Plug in $\eta = \frac{G\sqrt{T}}{D_R}$ shows that $R_{T}^{FTRL, P} = O\Big(D_R G \sqrt{T}\Big)$, which completes the proof.

Remark:

In the FTRL algorithm, find $w_{t+1}$ may be not very easy since it is solving a hard optimization problem. To make the algorithm more efficient, can we approximate (first order approximation) $f_t$ as follows:

$$\begin{align*} \tilde{f}_{t}(w) = f_t(w_t)+\langle \nabla f_t(w_t), w-w_t\rangle. \end{align*}$$

Now apply FTRL on the new set

$$\begin{align*} \{\tilde{f}_{1}, \cdots, \tilde{f}_{T}\}, \end{align*}$$

we have,

$$\begin{align*} w_{t+1} &= argmin_{w \in W} \sum_{i=1}^t \tilde{f}_{i}(w)+ \eta R(w) \\\ &= argmin_{w \in W} \sum_{i=1}^t f_i(w_i)+\langle \nabla f_i(w_i), w-w_i\rangle + \eta R(w) \\\ &= argmin_{w \in W} \sum_{i=1}^t \langle \nabla f_i(w_i), w-w_i\rangle + \eta R(w) \\\ &= argmin_{w \in W} \left\langle \sum_{i=1}^t \nabla f_i(w_i), w \right \rangle + \eta R(w), \end{align*}$$

which is much easier since the first term is just a linear function in $w$. In the next lecture, we will show that we actually don't lose much using this approximation.

Homework: Tuning Parameters

We are going to imagine we have some algorithm $\mathcal{A}$ with a performance bound that depends on some input values (which can not be adjusted) and some tuning parameters (which can be optimized). We will use greek letters ($\alpha, \eta, \zeta,$ etc.) for the tuning parameters and capital letters ($T, D, N$, etc.) for inputs. We would like the bound to be the tightest possible, up to multiplicative constants. For each of the following, tune the parameters to obtain the optimal bound. Using big-Oh notation is fine to hide constants, but you must not ignore the dependence on the input parameters. For example, assuming $M,T > 0$, imagine we have a performance guarantee of the form:

$$\text{Performance}(\mathcal{A}; M,T, \epsilon) \leq M \epsilon + \frac{T}{\epsilon},$$ and we know $\epsilon > 0$. Then by optimizing the above expression with respect to the free parameter we can set $\epsilon = \sqrt{\frac{T}{M}}$. With this value we obtain $\text{Performance}(\mathcal{A}; M,T, \epsilon) = O(\sqrt{MT})$

NOTE: We didn't have to make up this problem, We actually pulled all the bounds below from different papers!

  1. $\textnormal{Performance}(\mathcal{A}; T, \eta) \leq \max(T \eta, \eta^{-2})$
  2. $\textnormal{Performance}(\mathcal{A}; T, \eta) \leq \frac T \eta + \exp(\eta)$. (Note: you needn't obtain the optimal choice of $\eta$ here or the tightest possible bound, but try to tune in order to get a bound that is $o(T)$ -- i.e. the bound should grow strictly slower than linear in $T$.)
  3. $\textnormal{Performance}(\mathcal{A}; N, T, \eta, \epsilon) \leq \frac{ T \epsilon }{\eta} + \frac{N}{\epsilon} + T \eta$
  4. $\textnormal{Performance}(\mathcal{A}; N, T, \eta, \epsilon) \leq \frac{\log N}{\eta} + \frac{\eta T}{\epsilon^2}+\epsilon T.$
  5. $\textnormal{Performance}(\mathcal{A}; T, N, \eta) \leq \frac{\log N + \eta T}{1 - \exp(-\eta)}$
⚠️ **GitHub.com Fallback** ⚠️