CS7545_Sp24_Lecture_16 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Guanghui Wang

Notes for Lecture 16: Online Learning

March 12, 2024

Scribes: Eshani Chauk, Rajit Khanna

Online Learning Recap

Framework

For t = 1...T do:

  1. Learner picks $w_{t}$ from convex set $W \subseteq \mathbb{R}^d$. Adversary picks convex loss $f_t: W → \mathbb{R}$.
  2. Learner observes $f_t$, suffers $f_t(w_t)$ and updates $w_{t}$.

End for

Assumptions

  1. Bounded Sets: $\forall \text{ } w, w' \in W, ||w-w'||_2 \leq D$
  2. Bounded Gradients: $\forall \text{ } w \in W, \forall \text{ } t \in \lbrace 1,...,T \rbrace, ||\nabla f_t(w)|| \leq G$

Goals and Algorithms

The goal of Online Convex Optimization (OCO) is to minimize the regret $R_T$. We compare the cumulative loss of our decisions with the loss of the best decision $w^*$ which minimizes the cumulative loss.

$$R_T = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(w^*)$$

$$ w^* = argmin_{w \in W} $$

In the last lecture, we discussed the Projected Online Gradient Descent (P-OGD) algorithm to solve the problem of minimizing regret. The update rule for P-OGD is as follows:

$$ w'_{t+1} = w_t - \eta * \nabla f_t(w_t)$$

$$ w_{t+1} = \Pi_w (w'_{t+1})$$

We also proved a bound on the P-OGD regret. Assume that the bounded set and bounded gradient assumptions hold. Then, P-OGD with $\eta = \frac{D}{G \sqrt{T}}$ has a regret as follows:

$$ R_T^{P-OGD} = DG \sqrt{T} $$

Follow the Leader

Introduction

We introduce a new approach to defining the regret as a function of competitors. We compare our decisions with those of the competitor to calculate the regret:

$$ R_T(U) = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(U) $$

This definition of regret is equivalent to our original definition. If we can guarantee that the regret w.r.t. $w^*$ is sublinear w.r.t. $T$, then we can equivalently say that the regret of our algorithm is sublinear w.r.t. $T$ for any competitor $U$. Essentially, we are able to state the following:

$$ R_T(w^*) = \mathcal{O}(T) \text{ } ^←_→ \text{ } R_T(U) = \mathcal{O}(T), \forall \text{ } U \in W $$

Using the definition of regret, we can relate $R_T(w^*)$ and $R_T(U)$. We know that by definition, $w^*$ is the minimizer of cumulative loss, which maximizes the regret. Thus, for any competitor $U \in W$, the cumulative loss of the competitor can only be equal to or worse than that of $w^*$:

$$ R_T(U) = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(U) \leq \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(w^*) = R_T(w^*) = \mathcal{O}(T) $$

FTL Algorithm

Now, we want to define a new algorithm that solves the problem of minimizing regret, but from a learning perspective.

  • We know that in an offline setting, we have the following sets:
    • Training set on which we perform Empirical Risk Minimization (ERM)
    • Testing set to which we hope the ERM can generalize well
  • We want to utilize this same framework in an online setting. We define the following sets with loss functions $f_t$:
    • Training set: $f_{1,...,t}$
    • Testing set: $f_{t+1}(\cdot)$

We want to pick $w_{t+1}$ such that it minimizes the cumulative loss from rounds 1 to $t$, which is the Follow The Leader (FTL) algorithm. This is a computationally expensive process. Potential workarounds are to use first-order approximations or transform efficient offline algorithms to work online, but this will be further discussed in future lectures. The FTL algorithm chooses $w_{t+1}$ as follows:

$$w_{t+1} = argmin_{w \in W} \sum_{i=1}^t f_i(w) $$

Lemma: Be The Leader

How do we know that FTL is a good algorithm? To evaluate this, we introduce the Be The Leader (BTL) lemma for the regret of FTL. We have, $\forall \text{ } U \in W$:

$$ R_T^{FTL} = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(U) \leq \sum_{t=1}^T (f_t(w_t) - f_t(w_{t+1})) $$

$$\text{where $f_t(w_t)$ is the loss at round $t$ with $w_t$ and $f_t(w_{t+1})$ is the loss at round t with $w_{t+1}$ }$$

The stability of the FTL algorithm is given by $f_t(w_t) - f_t(w_{t+1})$. The regret of an algorithm is based on how stable it is.

  • If decisions are changed often, then the algorithm is unstable. Thus, the regret is hard to control.
  • If decisions are fairly consistent, then the algorithm is stable because decisions do not vary much. Thus, the regret will be small or 0.

Proof of BTL Lemma

We will use induction for this proof. Our goal is to show, $\forall \text{ } U \in W$,

$$\sum_{t=1}^T f_t(w_{t+1}) \leq \sum_{t=1}^T f_t(U) \text{ which is equivalent to saying } \sum_{t=1}^T f_t(w_{t+1}) - \sum_{t=1}^T f_t(U) \leq 0$$

There is an algorithm with regret of 0, but this is cheating because we use the decision of $w_{t+1}$ to reach this regret. Thus, we can force $w_t$ to be close to $w_{t+1}$ in order to make the regret smaller. If $w_t$ and $w_{t+1}$ are far apart, then the regret is larger.

For $T=1$, we know that $f_1(w_2) \leq f_1(U)$. This holds by the definition of $w_2$, which minimizes $f_1$. $f_1(w_2)$ should be less than $f_1(U) \forall \text{ } U \in W$ because $w_2$ is the minimizer of all elements in $W$ for $f_1$.

For $T=T-1$, we assume the following, which we define as our inductive hypothesis.

$$\sum_{t=1}^{T-1} f_t(w_{t+1}) \leq \sum_{t=1}^{T-1} f_t(U)$$

We want to show that this statement holds for $T = T$.

$$\text{Step 1: Add a fixed term to both sides}$$

$$\sum_{t=1}^{T-1} f_t(w_{t+1}) + f_T(w_{T+1}) \leq \sum_{t=1}^{T-1} f_t(U) + f_T(w_{T+1})$$

$$\text{Step 2: Merge the LHS into one summation}$$

$$\sum_{t=1}^{T} f_t(w_{t+1}) \leq \sum_{t=1}^{T-1} f_t(U) + f_T(w_{T+1})$$

$$\text{Step 3: Use the inductive hypothesis to replace $U$ with $w_{T+1}$ because the hypothesis applies } \forall U \in W \text{, and $w_{T+1} \in W$}$$

$$\sum_{t=1}^{T} f_t(w_{t+1}) \leq \sum_{t=1}^{T-1} f_t(w_{T+1}) + f_T(w_{T+1})$$

$$\text{Step 4: Merge the RHS into one summation}$$

$$\sum_{t=1}^{T} f_t(w_{t+1}) \leq \sum_{t=1}^{T} f_t(w_{T+1})$$

We know that by our earlier definition of $w_{T+1}$, which is the decision from FTL, $w_{T+1}$ is the minimizer of the cumulative loss. Thus, we can say that any competitor $U$ will have a higher cumulative loss than that of $w_{T+1}$. We state the final step of this proof using this property:

$$\sum_{t=1}^{T} f_t(w_{t+1}) \leq \sum_{t=1}^{T} f_t(U) \text{\quad} \forall \text{ } U \in W$$

Good cases for FTL

  1. $f_1 = f_2 = \dots f_T \implies R_T^{FTL} = O(1)$
  2. Leader does not switch too much $\Sigma_{t=1}^T \mathbb{1} (w_t \neq w_{t+1}) \leq O(k) \implies R_T^{FTL} = O(1)$
  3. Online Quadratic Optimization: OCO + adversary picks quadratic loss function $f_t(w) = \frac{1}{2} ||w - z_t||_2^2$.

Theorem for Online Quadratic Optimization: Assume $||z_t ||_2 \leq L$, then $R_T^{FTL} = O(L^2 log T)$.

Proof: Recall that we define $w_{t+1} = argmin_{w \in W} \Sigma_{i =1}^t f_i(w)$. In this case, we have that $w_{t + 1} = argmin_{w \in \mathbb{R}^d} \frac{1}{2} \Sigma_{i =1}^t ||w - z_i||_2^2$.

We solve the optimization problem using the first derivative test and find that

$$w_{t + 1} = \frac{1}{t}\sum_{i=1}^t z_i$$

It follows from this definition that

$$w_t = \frac{1}{t-1}\sum_{i=1}^{t-1} z_i$$

Now we write $w_{t+1}$ in terms of $w_t$

$$w_{t+1} = \frac{1}{t}\sum_{i = 1}^{t - 1} z_i + \frac{1}{t}z_t$$

$$\implies w_{t+1} = \frac{t-1}{t}\frac{1}{t-1} \sum_{i=1}^{t-1} z_i + \frac{1}{t}z_t$$

$$\implies w_{t+1} = \frac{t-1}{t} w_t + \frac{1}{t}z_t$$

Therefore, we can say that

$$w_{t+1} - z_t = (1 - \frac{1}{t})(w_t - z_t)$$

We will use this result to bound the regret term $R_T^{FTL}$. From the BTL Lemma, we have that

$$R_T^{FTL} \leq \sum_{t=1}^T (f_t(w_t) - f_t(w_{t+1}))$$

We can apply the definition of $f_t$.

$$ \implies R_T^{FTL} \leq \frac{1}{2} \sum_{t=1}^T (||w_t - z_t||_2^2 - ||w_{t+1} - z_t||_2^2 $$

Let's substitute for $w_{t+1} - z_t$ using our previous result.

$$ \implies R_T^{FTL} \leq \frac{1}{2} \sum_{t=1}^T ||w_t - z_t||_2^2- ||(1 - \frac{1}{t})(w_t - z_t)||_2^2 $$

$$ \implies R_T^{FTL} \leq \frac{1}{2} \sum_{t=1}^T ||w_t - z_t||_2^2- (1 - \frac{1}{t})^2 ||w_t - z_t||_2^2 $$

$$ \implies R_T^{FTL} \leq \sum_{t=1}^T ||w_t - z_t||_2^2 \frac{1}{2}(1 - (1 - \frac{1}{t})^2) $$

Let's consider the two terms inside the sum separately.

Let $A = ||w_t - z_t||_2^2$. To bound this result, we will rely on our initial assumption that $||z_t||_2 \leq L$ and the fact that each $w_t$ is an average of $z_i$'s.

$$ A = ||w_t - z_t|_2^2 \leq 2||w_t||_2^2 + 2||z_t||^2 \leq 2L^2 + 2L^2 = 4L^2 $$

Let $B = \frac{1}{2}(1 - (1 - \frac{1}{t})^2)$.

$$ B = \frac{1}{2}(1 - (1 - \frac{1}{t})^2) = \frac{1}{2}(1 - (1 - \frac{2}{t} + \frac{1}{t^2})) = \frac{1}{2}(\frac{2}{t} - \frac{1}{t^2}) \leq \frac{1}{t} $$

Now, combining our results, we see that

$$ R_T^{FTL} \leq \sum_{i=1}^T \frac{4L^2}{t} = 4L^2 \sum_{i=1}^T \frac{1}{t} \leq 4L^2 (\log T + 1) = O(L^2 \log T) $$

Bad Cases for FTL

We do not achieve the same bound as we did for the $P-OGD$ algorithm if our loss function $f_t$ is unstable. Consider the following example. Let $W = [-1, 1], f_1(w) = \frac{1}{2}w$ and $\forall t \in \lbrace 2 \dots T \rbrace$

$$f_t(w) = \begin{cases} -w & \text{t even} \\ w & \text{t odd} \end{cases}$$

Observe that

$$\Sigma_{i = 1}^t f_i(w) = \begin{cases} -\frac{1}{2}w & \text{t even} \\ \frac{1}{2}w & \text{t odd} \end{cases}$$

because $f_1(w) = \frac{1}{2}w$ and the remaining terms will form a telescoping sum where every odd term cancels out every even term.

Recall $w_{t + 1} = argmin_w \Sigma_{i = 1}^t f_t(w)$. Therefore,

$$w_{t + 1} = \begin{cases} 1 & \text{t even} \\ -1 & \text{t odd} \end{cases}$$

Because on the domain $[-1, 1]$, $1$ minimizes $-\frac{1}{2}w$ and $-1$ minimizes $\frac{1}{2}w$. We can also see that

$$f_{t+1}(w_{t + 1}) = \begin{cases} 1 & \text{t even} \\ 1 & \text{t odd} \end{cases}$$

by the definition of $f_t$.

Now, to compute the regret we see that $\Sigma_{t=1}^T f_t(w_t) = T$ and $min_w \Sigma_{t=1}^T f_t(w_t) = -\frac{1}{2}$. Therefore, $R_T^{FTL} = T + \frac{1}{2}$ which is linear in $T$ and over enough iterations our algorithm does not converge to the best decision.

Foreshadowing

In the next lecture, we will demonstrate a way to handle this bad case by "stabilizing" the function $f_t$. We will call this new algorithm Follow The Regularized Leader (FTRL) and choose each next decision $w_{t + 1}$ such that $w_{t + 1} = argmin_{w \in W} \Sigma_{i=1}^t f_i(w) + \eta R(w)$, where $R(w)$ is the regularization term.

Homework: Dynamic Regret

(Graded) In class, we learned how to get the regret guarantee of convex functions by P-OGD and FTL. Note that in regret, the competitor $w^*$ is fixed. In this problem, we consider a generalized notation of regret: dynamic regret, which is defined as

$$ R_T:=\sum_{t=1}^T f_t (w_t)-\sum_{t=1}^T{f_t(w_t^*)} $$

where $w_t^*\in W$ is a sequence of competitors and $$\sum_{t=2}^T ||w_t^*-w_{t-1}^*||_2\leq P_T.$$ For convenience we assume $0\in W$. Can you get the regret guarantee in terms of $G,D,T$, and $P_T$? What is the optimal $\eta$ in this scenario?

Hints: 1) you can start from

$$ R_T \leq \sum_{t=1}^T \frac{\eta G^2}{2} + \sum_{t=1}^T \frac{||w_t - w_t^*||^2 - ||w_{t+1} - w_t^{*}||^2}{2\eta}, $$

extending the square norm, and consider how to use the definitions of $D$ and $P_T$ to finish the telescoping sum. 2) When tuning $\eta$ in the final step, consider the upper bound as a function of $\eta$, and then try to find the optimal configuration.

⚠️ **GitHub.com Fallback** ⚠️