Lecture_20 - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2023
Instructor: Jacob Abernethy
Notes for Lecture 20
March 16, 2023
Scribes: Sriram Yenamandra, Bo Yuan
Announcements
- Students who were not able to act as scribes might be asked to review and edit the lecture notes utilising tools such as ChatGPT.
- In case you require extra office hours, please select your preferred time slot through the Piazza poll.
- An exam guide with practice problems will be released next week.
- A guide for the exam containing practice problems will be posted next week.
- The exam is scheduled for March 30th (Thursday) and will be held in-class. There will be a preparation session on March 28th (Tuesday).
Lecture
Original Experts Problem
Suppose you have
After the experts and you make predictions on each day
Additionally, assume that one of the experts is always correct (referred to as the "perfect expert assumption"). Without prior knowledge of who the perfect expert is, what approach would you adopt to minimize the number of errors in your predictions?
We discuss the "Halving algorithm" that guarantees that the number of errors made in the predictions is at most
Halving algorithm
The Halving algorithm starts by assuming all experts to be reliable and makes predictions based on a majority vote among these experts. On each day, the algorithm removes the experts who made incorrect predictions from the list of reliable experts and only considers the remaining experts for future predictions.
Let
- Initially, we consider all experts to be reliable:
$C_1 = [n]$ . - Next on day
$t$ , we make predictions using a majority vote on the current set of reliable experts:$\hat{y^t} = round\left(\frac{1}{|C_t|} \sum_{i\in C_t} {z_i}^t \right)$ . - Stop considering the experts who make incorrect predictions on current day:
$C_{t + 1} \leftarrow C_{t} \setminus \{i: {z_i}^t \neq y^t\}$ . - Repeat 2 and 3.
The above algorithm results in at most
Claim:
-
$|C_t| \geq 1$ . -
$|\{t: \hat{y^t} \neq y^t \}| \leq \log_2{n}$ .
Proof:
- This follows from the perfect expert assumption: there is at least one expert that makes the correct predictions on all days.
- Consider a day
$t$ where the algorithm mispredicts:$\hat{y^t} \neq y^t$ . This implies that on the day$t$ , the majority of the experts in$C_t$ mispredict, i.e.$|\{z_i^t \neq y^t\}| \geq \frac{|C_t|}{2}$ . This would mean that the algorithm would drop at least$\frac{|C_t|}{2}$ experts:$|C_{t + 1}| \leq |C_{t}| - \frac{|C_{t}|}{2} = \frac{|C_{t}|}{2}$ . Thus, every day the algorithm makes an error the number of experts reduces by half. But from (1), the number of experts can't go below 1. Thus, the algorithm would make an error at most$\log_2{n}$ times.
Game Theory
Von Neumann's Minimax Theorem
Zero-sum game
A two-player zero-sum game is specified by a matrix
For
Example: Rock-paper-scissors
Consider the example of the classic "Rock-Paper-Scissors" game. Here both players can select from a set of 3 actions {rock (
Here, the rows and columns are ordered using the following order of actions:
Mixed strategy profile
A mixed strategy profile is a pair of distributions
The expected loss to player 1 given
Minimax Theorem
For any matrix
Proof:
- It is trivial to show
$\min_{\textbf{p} \in \Delta_n}\max_{\textbf{q} \in \Delta_m}\textbf{p}^T M \textbf{q} \geq \max_{\textbf{q} \in \Delta_m}\min_{\textbf{p} \in \Delta_n}\textbf{p}^T M \textbf{q}$ . This is known as the weak duality. - The opposite direction can be proved by the Hedge algorithm's guarantees. More details will be provided later.
Hedge Setting
There are
For
- The algorithm chooses
$p^t \in \Delta_n$ - Nature chooses
$\ell^t \in [0,1]^n$ , where$\ell^t_i$ is the loss of action$i$ at$t$ - The algorithm suffers
$\langle p^t,\ell^t \rangle $
End for
The regret is
Hedge Algorithm
At
Theorem
The Hedge algorithm guarantees that
Proof:
The proof is exactly the same as the proof of EWA, except for Jensen's inequality.
By definition,