CS7545_Sp23_Lecture_15: Online Learning - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 15: Online Learning

February 28, 2023

Scribes: John Hill, Victor Coleman

This is the introductory lecture on Online Learning. We begin with a recap of our last topic: statistical learning theory, then introduce the topic of Online Learning. We discuss several common examples and conclude by talking about Online Convex Optimization.

Online-Learning (OL)

Recap: Statistical Learning Theory

Goal: Minimize the generalization error:

$$ \min_{h \in \mathcal{H}} L(h) = \mathbb{E}_{(x, y) \sim \mathcal{D}} [\ell (h(x), y)]$$

In the real world, we rarely know the true distribution that our data comes from. For this reason, we have to take a sample,

$$S = \left\{ (x_i , y_i) \right\}_{i=1}^m \sim \mathcal{D}^m$$

and minimize the empirical risk:

$$\widehat{L}(h) = \frac{1}{m} \sum_{i=1}^m \ell (h(x), y).$$

In class, we previously showed the following generalization bound for finite hypothesis classes.

Theorem:

$\forall h \in \mathcal{H}$ , with probability greater than $1-\delta$,

$$L(h) - \widehat{L}(h) \leq \sqrt{\frac{\log \frac{1}{\delta} + \log |\mathcal{H}| }{2m}}$$

Limitations of the Statistical Learning Theory Framework

  1. Relies on the iid assumption

    • The assumption that our training set is sampled iid from an unchanging underlying distribution is often not true in practice.

    • Counterexamples:

      • Distribution shift (e.g. recommendations - peoples' taste change over time)
      • Time series data (The past influences the present)
      • Adversarial environments (e.g. spam and fraud detection - adversarial actors evolve their behavior to bypass these systems)
  2. Computationally inefficient (Offline)

    • Real world:
      • Data comes sequentially
    • As we get new data we would like to be able to update our model incrementally. In the offline learning framework, if we want to update our model we need to train it again from scratch using all of our data.
  3. Memory inefficient

    • Since we need to re-train any new models from scratch we are required to keep all of our previous data stored in memory.

Framework: Online Learning

For t = 1, 2, ..., T do:
    Learner picks a decision w_t from a set W
    Simultaneously, environment chooses a loss function $f_t$ which maps from the decision set to the reals
    Learner then suffer loss f_t(w_t) and updates w_t
End For

Goal: Minimize the cumulative loss

$$\min \sum_{t=1}^T f_t(w_t)$$

Examples

  1. Online Linear Classification:

    • In each round $t$:
      • Output a classifier $w_t \in \mathbb{R}^d$
      • Observe $(x_t, y_t)$, $x_t \in \mathbb{R}^d$, $y_t \in \{ +1, -1 \}$
      • Suffers loss $\ell (w_t ; (x_t, y_t))$ (e.g., square loss: $f_t(w_t) = (w_t^\top x_t - y_t)^2$ )
    • Note that while $\ell$ may be consistent across rounds, the particular parameterization (i.e., choice of ${(x_t, y_t)}$) changes. Consequently, the loss function for a particular round (i.e., $f_t$) changes over time since we observe new data in each time step.
  2. Prediction with Expert Advice:

    • We have a finite hypothesis set $\mathcal{H} = \{ h^{(1)}, h^{(2)}, ..., h^{(N)} \}$.
    • In each round $t$:
      • Pick $h_t \in \mathcal{H}$
      • Observe loss vector $p_t \in [0, 1]^N$
        • Notice that we only suffer loss $f_t(h_t)$, but we get to see the loss value for all of our $N$ predictors during each time step.
    • Think of the set $\mathcal{H}$ as a set of $N$ experts. During each time step we get to see the loss values for all of our experts.
    • The experts remain consistent between rounds, and instead each round, we update the strategy used to select experts (policy).
    • We can also maintain a distribution over our experts instead of picking just one.
    • Let $w_t \in \Delta^N$ (simplex) be the expert distribution. We will update our distribution over time according to the loss we observe and sample our chosen expert $h^* \sim w_t$ according to distribution $w_t$.
  3. Video Recommendation:

    • Let $\mathcal{M} = \{ 1,2, ..., k \}$ be a finite catalogue of videos.
    • In each round $t$:
      • We pick $i_t \in \mathcal{M}$
      • Suffer loss $f_t(i_t)$, where $f_t : \mathcal{M} \rightarrow \mathbb{R}$
    • In the example, we only get feedback for the the video recommendation that we actually made. This differs from example 2 (Prediction with Expert Advice) where we got feedback for every potential decision during each time step.

Types of Online Learning

  1. Feedback model: Bandit / Full-information Online Learning

    • Bandit: Only know the loss of the chosen decision (e.g., Video Recommendation).
    • Full-Information: Know loss of all other decisions as well (e.g., Prediction with Expert Advice).
  2. Adversary: iid / Oblivious / Non-oblivious Adversary

    • iid: Adversarial distribution is fixed in advance and we sample data iid from this distribution.
    • Oblivious: Adversary can change their plan over time, but they do not have access to your past decisions.
    • Non-oblivious: Adversary can change their plan over time AND they have access to your past decisions.
  3. Specific Problems:

    • Online Convex Optimization
    • Prediction with Expert Advice
    • Multi-armed Bandits (MAB), Linear Bandits

Online Convex Optimization

  • Special class of full-information online learning, in which the following assumption is made:

$\quad$ Assumption: The decision set $\mathcal{W} \subseteq \mathbb{R}^d$ is a convex set, and $\forall t \in [T]$, $f_t : \mathcal{W} \rightarrow \mathbb{R}$ is a convex function.

(Recall) First Order Condition of Convexity:

$\quad$ For differentiable function $f: \mathcal{W} \rightarrow \mathbb{R}$, $f$ is convex if and only if

$$\forall x, y \in \mathcal{W}, ~ f(y) \geq f(x) + \nabla f(x)^\top (y-x).$$

The important takeaway from this theorem is that there is a linear lower bound for a convex function at every point in the domain.

  • Why convexity?

    1. Many real-world problems are convex.
    2. Convexity defines a tractable case for optimization (we cannot solve optimization efficiently in general).
    3. We can use convexity as a tool for understanding why algorithms work. We often use assumptions like local convexity to understand the non-convex case.
  • Performance Measure: Regret

    • The regret of algorithm $\mathcal{A}$ after $T$ iterations is defined as: $$\text{Reg}_T ^{\mathcal{A}} := {\sum_{t=1}^T f_t(w_t)} - \min_{w \in \mathcal{W}} \sum_{t=1}^T f_t(w)$$
    • Regret is the difference between the cumulative loss across our decisions and the best possible loss in hindsight.
    • It is an extension of excess risk: $L(h) - \min_{h^* \in \mathcal{H}} L(h^*)$
    • It adds a benchmark to cumulative loss:
      • We want an algorithm, $\mathcal{A}$, such that our regret is sub-linear: $\text{Reg}_T^{\mathcal{A}} = o(T)$
      • e.g. $O(\sqrt{T})$, $O(\log T)$, etc.
    • With sub-linear regret we have: $$\textbf{Hannan Consistency:} \lim_{T \rightarrow \infty} \frac{1}{T} \text{Reg}_{T}^{\mathcal{A}} \rightarrow 0$$
⚠️ **GitHub.com Fallback** ⚠️