CS7545_Sp23_Lecture_23: Boosting for Equilibrium - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 23

April 11, 2023

Scribes: Shikun Liu, Trenton Wong

Nash Equilibrium

Consider a zero-sum game with matrix $M\in\mathbb{R}^{n\times m}$, and apply the online learning strategies from Lecture 22:

Version 1:

For $t=1,...,T:$

  1. Row player uses Algorithm 1 to choose $p_t\in\Delta_n$
  2. Column player uses Algorithm 2 to choose $q_t\in\Delta_m$
  3. Row player uses an algorithm (e.g. EWA) to update loss $\ell_t=Mq_t$
  4. Column player uses an algorithm (e.g. EWA) to update loss $h_t=-M^\top p_t$

Version 2:

For $t=1,...,T:$

  1. Row player uses Algorithm 1 to choose $p_t\in\Delta_n$
  2. Column player chooses $q_t=\arg\max_{q\in\Delta_m}p_t^\top Mq$
  3. Row player uses an algorithm (e.g. EWA) to update loss $\ell_t=Mq_t$

As mentioned in the previous lecture, this second method can guarantee that $$\bar p_T=\frac{1}{T}\sum_{t=1}^T p_t$$ and $$\bar q_T=\frac{1}{T}\sum_{t=1}^T q_t$$ are $\epsilon$-approximate Nash equilibria with $\epsilon=\frac{\text{Regret}_T}{T}$. With the use of EWA, $\epsilon=\frac{\text{Regret}_T}{T}=O(\sqrt{\frac{\log n}{T}})$.

Note: by choosing an argmax, $q_t$ will be a basis vector $\in{\vec{e}_1,...,\vec{e}_m}$. This is because the goal is to maximize a probability vector, so the argmax will simply choose the highest entry in the vector and set its corresponding entry in $q_t$ to 1. This will maximize the expression.

WLA and SLA

Consider samples $x_i\in\mathcal{X}$, $y_i\in{-1,1}$, such that $S={(x_1,y_1),...,(x_n,y_n)}$.

Weak Learning Assumption: With some $\gamma>0$ and "weak" hypothesis class $\mathcal{H}$, for all $p\in\Delta_n$, there exists some $h\in\mathcal{H}$ such that $Pr_{i\propto p}(h(x_i)=y_i)\geq\frac{1}{2}+\gamma$.

Liken the probability to a sum. If $h(x_i)=y_i$, then it adds 1. If $h(x_i)\neq y_i$, then it adds 0. To achieve a similar effect, multiply $h(x_i)$ and $y_i$. If $h(x_i)=y_i$, since they are either 1 or -1, then $h(x_i)y_i=1$. If $h(x_i)\neq y_i$, then $h(x_i)y_i=-1$. Sum all of these, multiplied by $p_i$.

$$\sum_{i\in[n]}p_ih(x_i)y_i$$

While the probability was bounded by [0,1], this new sum is bounded by [-1,1], although it accomplishes something very similar. This leads to an inequality that is equivalent to the WLA:

$$\sum_{i\in[n]}p_ih(x_i)y_i\geq 2\gamma.$$

Now consider the Strong Learning Assumption: there exists a $q\in\Delta_{|\mathcal{H}|}$ such that for all $i\in[n]$, $\text{sign}(\sum_{h\in\mathcal{H}}q_hh(x_i))=y_i$. The same logic can be applied to this assumption, since if $h(x_i)=y_i$, $h(x_i)y_i>0$, and if $h(x_i)\neq y_i$, $h(x_i)y_i<0$. This leads to another inequality, equivalent to the SLA:

$$\sum_{h\in\mathcal{H}}q_hh(x_i)y_i>0.$$

Now, let's play a game!

Consider $M\in {-1, 1}^{n\times |\mathcal{H}|}$, where we denote $|\mathcal{H}| = m$ starting from this point.

$$M_{ij} = h_j(x_i)y_i$$

This exactly mimics the min-max game by considering one player as $h(x_i)$ and the other to be $y_i$. In this case, we can first formulate the WLA and SLA in the following manner:

$$ \begin{align*} &WLA: \forall p\in \Delta_m, \exists j\in [m], ; p^\top Me_j\geq 2\gamma,\\ &SLA: \forall q \in \Delta_{|\mathcal{H}|}, \exists i \in [n], e_i^\top Mq > 0. \end{align*} $$

Then, if we think from the min-max perspective, we can further transform the inequalities as follows:

$$ \begin{align*} &WLA: \min_{p\in\Delta_n} \max_{j\in[m]} p^\top Me_j \geq 2\gamma,\\ &SLA: \max_{q\in \Delta_{|\mathcal{H}|}}\min_{i\in [n]} e_i^\top Mq > 0. \end{align*} $$

The equivalence can be made when we have the $\geq$ and $>$ relations:

  • For WLA: first consider $j$ given a fixed $p$, if $\exists j \geq 2\gamma$, then the max value must be $\geq 2\gamma$. Then, we consider the outer selection of $p$. If all the $p$ values can be $\geq 2\gamma$, then the min value will also be $\geq 2\gamma$.
  • For SLA: same logic applies where $\min$ implies $\forall$ and $\max$ imples $\exists$.

Finally, the last step is to replace the $e_i$ and $e_j$ with $p$ and $q$:

$$ \begin{align*} &WLA: \min_{p\in\Delta_n} \max_{q\in \Delta_m} p^\top Mq \geq 2\gamma,\\ &SLA: \max_{q\in \Delta_m}\min_{p\in \Delta_n} p^\top Mq > 0. \end{align*} $$

By the minmax THM: $WLA(\gamma) \implies SLA$ since $SLA \geq 2\gamma > 0$.

Connection to Adaboost: Adaboost is (almost) using EWA v.s. argmax game playing to find an $\epsilon$-approx equilibrium to this game.

Simplified version of Adaboost:

Init $p_1 = [\frac{1}{n}, \cdots \frac{1}{n}]$

For $t = 1, \cdots, T:$

  1. Choose $h\in \mathcal{H}$, s.t $$\sum_{i=1}^n p_ih_t(x_i)y_i \geq 2\gamma$$
  2. Set p_{t+1} $\in \Delta_n$ as follows: $p_{t+1}(i) \propto p_t(i)\exp(-\eta h(x_i)y_i)$

Output: Ensemble of $h_1, \cdots, h_T$

The idea of the Adaboost algorithm is to pick an action that is better than some threshold, which is not necessarily the argmax solution as "weak learners". Then, the algorithm is to downweight the correctly predicted data and upweight the wrong samples.

The final prediction is the majority vote of the answer which is equivalent to the $\bar q_T$.

$$f_{h_1, \cdots, h_T (x)} = \text{sign}(\sum_{t=1}^T h_t(x))$$

How do we set $T$ to get a good answer to Adaboost?

Suppose $$\epsilon = c\sqrt{\frac{\log n}{T}}< 2\gamma,$$

then $$T = \Omega(\frac{\log n}{\gamma^2}).$$

How to interpret the zero-training error in terms of a game?

For all possible data point distributions, we wish to find a model with zero error. Similarly, for the actions the other player did, one player wishes to have the optimal action correspondingly.

⚠️ **GitHub.com Fallback** ⚠️