Lecture_20 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 20

March 16, 2023

Scribes: Sriram Yenamandra, Bo Yuan


  • Students who were not able to act as scribes might be asked to review and edit the lecture notes utilising tools such as ChatGPT.
  • In case you require extra office hours, please select your preferred time slot through the Piazza poll.
  • An exam guide with practice problems will be released next week.
  • A guide for the exam containing practice problems will be posted next week.
  • The exam is scheduled for March 30th (Thursday) and will be held in-class. There will be a preparation session on March 28th (Tuesday).


Original Experts Problem

Suppose you have $n$ friends (referred to as "experts") who make predictions on each day $t$ where $t \in \{1, 2, \dots\}$. On the $t^{th}$ day, the $i^{th}$ expert makes a prediction ${z_i}^t \in \{0, 1\}$ and you make a prediction $\hat{y}^t \in \{0, 1\}$.

After the experts and you make predictions on each day $t$, the environment returns the correct prediction for that day, denoted by $y^t$. The performance here is measured by the number of errors you make over time: $|\{t: \hat{y^t} \neq y^t\}|$

Additionally, assume that one of the experts is always correct (referred to as the "perfect expert assumption"). Without prior knowledge of who the perfect expert is, what approach would you adopt to minimize the number of errors in your predictions?

We discuss the "Halving algorithm" that guarantees that the number of errors made in the predictions is at most $\log_2{n}$ errors.

Halving algorithm

The Halving algorithm starts by assuming all experts to be reliable and makes predictions based on a majority vote among these experts. On each day, the algorithm removes the experts who made incorrect predictions from the list of reliable experts and only considers the remaining experts for future predictions.

Let $C_t$ denote the set of experts that are considered reliable on day $t$.

  1. Initially, we consider all experts to be reliable: $C_1 = [n]$.
  2. Next on day $t$, we make predictions using a majority vote on the current set of reliable experts: $\hat{y^t} = round\left(\frac{1}{|C_t|} \sum_{i\in C_t} {z_i}^t \right)$.
  3. Stop considering the experts who make incorrect predictions on current day: $C_{t + 1} \leftarrow C_{t} \setminus \{i: {z_i}^t \neq y^t\}$.
  4. Repeat 2 and 3.

The above algorithm results in at most $\log_2{n}$ errors in predictions.


  1. $|C_t| \geq 1$.
  2. $|\{t: \hat{y^t} \neq y^t \}| \leq \log_2{n}$.


  1. This follows from the perfect expert assumption: there is at least one expert that makes the correct predictions on all days.
  2. Consider a day $t$ where the algorithm mispredicts: $\hat{y^t} \neq y^t$. This implies that on the day $t$, the majority of the experts in $C_t$ mispredict, i.e. $|\{z_i^t \neq y^t\}| \geq \frac{|C_t|}{2}$. This would mean that the algorithm would drop at least $\frac{|C_t|}{2}$ experts: $|C_{t + 1}| \leq |C_{t}| - \frac{|C_{t}|}{2} = \frac{|C_{t}|}{2}$. Thus, every day the algorithm makes an error the number of experts reduces by half. But from (1), the number of experts can't go below 1. Thus, the algorithm would make an error at most $\log_2{n}$ times.

Game Theory

Von Neumann's Minimax Theorem

Zero-sum game

A two-player zero-sum game is specified by a matrix $M \in \mathbb{R}^{n \times m}$. Here, we have two players competing in a game by taking actions. The first player can select from a set of $n$ actions and the second player can select from a set of $m$ actions.

For $i \in [n]$ and $j \in [m]$, $M_{ij} \leftarrow$ the loss to player 1 and reward to player 2, if player 1 plays $i$ and player 2 plays $j$.

Example: Rock-paper-scissors

Consider the example of the classic "Rock-Paper-Scissors" game. Here both players can select from a set of 3 actions {rock ($R$), paper ($P$), scissors ($S$)} in each round. This game would be specified using the following matrix:

$$\begin{bmatrix}0 & 1 & -1\\ -1 & 0 & 1\\ 1 & -1 & 0 \end{bmatrix}$$

Here, the rows and columns are ordered using the following order of actions: $P \rightarrow S \rightarrow R$.

Mixed strategy profile

A mixed strategy profile is a pair of distributions $\textbf{p}\in\Delta_n$ and $\textbf{q}\in \Delta_m$.

The expected loss to player 1 given $\textbf{p}$, $\textbf{q}$ is $\textbf{p}^T M \textbf{q}$ (or $\sum_{i, j} p_i q_j M_{ij}$).

Minimax Theorem

For any matrix $M \in \mathbb{R}^{n \times m}$, one has $\min_{\textbf{p} \in \Delta_n}\max_{\textbf{q} \in \Delta_m}\textbf{p}^T M \textbf{q} = \max_{\textbf{q} \in \Delta_m}\min_{\textbf{p} \in \Delta_n}\textbf{p}^T M \textbf{q}$.


  1. It is trivial to show $\min_{\textbf{p} \in \Delta_n}\max_{\textbf{q} \in \Delta_m}\textbf{p}^T M \textbf{q} \geq \max_{\textbf{q} \in \Delta_m}\min_{\textbf{p} \in \Delta_n}\textbf{p}^T M \textbf{q}$. This is known as the weak duality.
  2. The opposite direction can be proved by the Hedge algorithm's guarantees. More details will be provided later.

Hedge Setting

There are $n$ actions.

For $t = 1,\dots,T$ do

  1. The algorithm chooses $p^t \in \Delta_n$
  2. Nature chooses $\ell^t \in [0,1]^n$, where $\ell^t_i$ is the loss of action $i$ at $t$
  3. The algorithm suffers $\langle p^t,\ell^t \rangle $

End for

The regret is $$Regret_T = \sum_{t=1}^T \langle p^t,\ell^t \rangle -\min_{i \in [n]}\sum_{t=1}^T \ell^t_i.$$

Hedge Algorithm

At $t$, $$w_i^t = \exp(-\eta \sum_{s=1}^{t-1} \ell_i^s)$$ and $$p^t = \frac{w^t}{\sum_i w^t_i}.$$ Note that equivalently, $w_i^t = w_i^{t-1}\exp(-\eta \ell_i^{t-1})$.


The Hedge algorithm guarantees that $$Regret_T \leq \frac{\eta L_{\star}^T + \log N}{1-e^{-\eta}}$$ where $$L_{\star}^T = \min_{i \in [n]}\sum_{t=1}^T \ell^t_i.$$


The proof is exactly the same as the proof of EWA, except for Jensen's inequality.

By definition, $L^T_* \leq T$. Then $\eta$ can be tuned such that $L^T_* \leq T$ results in $\frac{Regret_T}{T} \rightarrow 0$.

⚠️ **GitHub.com Fallback** ⚠️