CS7545_Sp23_Lecture_21: Minimax Theorem - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 21

March 16, 2023

Scribes: Aditya Vikram, Darryl Cherian Jacob

Miscellaneous

Announcements

  • Vote for the Piazza poll for topics for the last "extra-curricular" leg of the course

Exam problem that got canceled:

Problem was to show that for two function classes $\mathcal{G}$ and $\mathcal{H}$, $\hat{\Re}_m(\mathcal{G}\cup\mathcal{H}) \leq \hat{\Re}_m(\mathcal{G}) + \hat{\Re}_m(\mathcal{H})$.

The original proof idea was the following (but has a bug):

$$\hat{\Re}_m(\mathcal{G}\cup\mathcal{H}) = \mathbb{E} _\sigma \left[ \sup _{f \in \mathcal{G}\cup\mathcal{H}} \frac{1}{m}\sum _{i=1}^m \sigma_i f(x_i) \right] \leq \mathbb{E} _\sigma \left[ \sup _{g \in \mathcal{G}, h \in \mathcal{H}} \frac{1}{m}\sum _{i=1}^m \sigma_i \left(g(x) + h(x)\right) \right],$$

but the inequality $\sup_{x\in A\cup B}\lbrace x \rbrace \leq \sup_{a\in A} \lbrace a\rbrace + \sup_{b\in B} \lbrace b\rbrace$ in the last step works only when both $\sup_{g \in \mathcal{G}} \frac{1}{m}\sum_{i} \sigma_i g(x)$ and $\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_{i} \sigma_i h(x)$ are non-negative. The problem can be fixed by adding the assumption that the function $0 \in \mathcal{G}, \mathcal{H}$, so that both sups are non-negative.

Von Neumann's Minimax Theorem (continued)

Theorem (von Neumann's Minimax Theorem)

For any matrix $M\in \mathbb{R}^{n\times m}$:

$$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q = \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q,$$

where $\Delta_n$ represents the $n$-dimensional Probability Simplex.

This is one of the versions of the Minimax theorem for probability simplices, there are more general versions that relax the constraints on domain as well as convexity (see Sion's minimax theorem).

How this fits into Rock-Paper-Scissors (RPS) from last lecture: $p$ and $q$ are the probability distributions chosen by Players $P_1$ and $P_2$, and $M$ is the game matrix as defined in the last lecture. Then $p^\top M q$ is the loss for $P_1$ and the gain for $P_2$. More analogies from the game are noted in the proof.

Proof

There are two statments we need to prove for the equality:

$$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q \leq \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q;$$

$$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q \geq \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q.$$

(2) is called Weak Duality and is the easier statement to show:

Let $$\hat{p} = \arg\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q \quad\text{and}\quad \hat{q} = \arg\max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q.$$ In RPS, $\hat{p}$ would be the best distribution that $P_1$ could choose if they had to commit first, and the other player could come up with the optimal strategy. Similarly, $\hat{q}$ would be the best possible distribution that $P_2$ could choose if they had to commit first. Now we have:

$$ \begin{align*} \min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q &= \max_{q \in \Delta_m} \hat{p}^\top M q && (\text{by def. of }\hat{p}, \text{ can be shown using compactness that min is achieved})\\ &\geq \hat{p}^\top M \hat{q} && (\because \hat{q} \text{ is one of elements in the domain of the maxima})\\ &\geq \min_{p \in \Delta_n} p^\top M \hat{q} && (\text{same reason as above for minima})\\ &= \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q. && (\text{by def. of }\hat{q})\\ \end{align*} $$

Note: $\hat{p}$ and $\hat{q}$ as defined above are a Nash Equilibrium pair, where we have that $\forall p \in \Delta_n, q \in \Delta_m$:

$$\underbrace{\hat{p}^\top M q} _{P_2 \text{ can only lose if they change } q} \quad\leq\quad \underbrace{\hat{p}^\top M \hat{q}} _{\text{Equilibrium game value}} \quad\leq\quad \underbrace{p^\top M \hat{q}} _{P_1 \text{ can only lose if they change } p}.$$

Now to prove (1), we will simulate an online learning setting and use regret bounds to show the result. Assume WLOG that $M \in [-1, 1]^{n \times m}$ (can always be scaled to these bounds if not). Imagine repeated round of play for $T$ rounds:

For $t$ in $1,\cdots, T$:

  • $P_1$ commits to $p_t \in \Delta_n$ and $P_2$ commits to $q_t \in \Delta_m$
  • $P_1$ receives a loss vector $l_t = Mq_t$
  • $P_2$ receives a loss vector $h_t = -M^\top q_t$ (gain is turned to a loss using the minus and we want to maintain column vector notation for losses)

Both players update their probability distributions using a sublinear regret bound online learning algorithm $\mathcal{A}$ like EWA. We only care that the Regret bound of the algorithm should be sublinear in $T$. We look at the average loss to $P_1$:

$$V_T = \frac{1}{T} \sum_{t=1}^T p_t^\top M q_t = \frac{1}{T} \sum_{t=1}^T p_t^\top l_t = \frac{1}{T} \min_{p \in \Delta_n} \sum_{t=1}^T p^\top l_t + \frac{1}{T}\text{Regret}_T^\mathcal{A},$$

where the last equality comes from the definition of regret:

$$\text{Regret}_T^\mathcal{A} = \sum _{t=1}^T p_t^\top l_t - \min _{p \in \Delta_n} \sum _{t=1}^T p^\top l_t.$$

Define $\epsilon_T^p := \frac{1}{T}\text{Regret}_T^\mathcal{A}$. Under our assumption of a sublinear regret algorithm, $\epsilon_T^p \rightarrow 0$ as $T \rightarrow \infty$. Using this definition, the above equation becomes

$$ \begin{align*} V_T &= \frac{1}{T} \min_{p \in \Delta_n} \sum_{t=1}^T p^\top M q_t + \epsilon_T^p\quad(\text{Substituting }l_t = Mq_t)\\ &= \min_{p \in \Delta_n} p^\top M \left(\frac{1}{T} \sum_{t=1}^T q_t\right) + \epsilon_T^p\\ &\leq \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q + \epsilon_T^p. \end{align*} $$

Because

$$\overline{q}_T = \frac{1}{T} \sum _{t=1}^T q_t \in \Delta_m, $$

so the max over all $q\in \Delta_m$ can only be larger. Similarly, we can write the inequalities corresponding to $P_2$, and because $P_2$'s loss has a negative sign, all the inequalities would be reversed. We would get

$$V_T = \frac{1}{T} \sum_{t=1}^T p_t^\top M q_t \geq \cdots \geq \max_{q \in \Delta_m} \left(\frac{1}{T} \sum_{t=1}^T p_t\right)^\top M q - \epsilon_T^q \geq \min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q - \epsilon_T^q.$$

Combining both these two bounds, we get

$$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q - \epsilon_T^q \quad\leq\quad \frac{1}{T} \sum_{t=1}^T p_t^\top M q_t \quad\leq\quad\max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q + \epsilon_T^p.$$

As $\epsilon_T^p, \epsilon_T^q \rightarrow 0$, the bounds get progressively tighter and we move towards the equilibrium.

It has been shown that the sequences $p_t$ and $q_t$ might oscillate and not converge to the Nash equilibrium. But the average iterates turn out to be approximate equilibriums.

Above we described a constructive algorithm for generating approximate equilibriums using the average iterates $\overline{p}_T$ and $\overline{q}_T$ because $\forall p \in \Delta_n, q \in \Delta_m$, we have:

$$\overline{p}_T^\top M q - \epsilon_T^q \quad\leq\quad \frac{1}{T} \sum _{t=1}^T p_t^\top M q_t \quad\leq\quad p^\top M \overline{q}_T + \epsilon_T^p.$$

⚠️ **GitHub.com Fallback** ⚠️