Multi Armed Bandits - shivamvats/notes GitHub Wiki
- Details in
writings
- Upper Confidence Bound (UCB): "Optimism in the face of uncertainty". Provides best-in-class regret for the stochastic setting.
- Bayesian UCB : UCB assumes no prior on the reward distribution and hence has to depend on the Hoeffding inequality to compute the confidence interval. However, if we have some prior knowledge about the reward distribution (say, that it is Gaussian), then we can compute much tighter confidence intervals.
- Thompson Sampling (TS): or Posterior Sampling or Probability Matching Competes with UCB in performance.
- EXP3 : Exponential weights for exploration and exploitation. Provides best-in-class regret for the adversarial setting (hence is randomized).
- Dynamic Thompson Sampling
-
Poor Man's Strategy: Assign a bandit to each context. Worst case regret is
R_n = sqrt(n|A||C|)
, whereA
is the action space andC
is space of contexts.
But this is bad. We need to assume more structure.
-
Linear Reward Model: Assume that the reward is a linear function of features
F(c, k)
, whereF
is the feature map,c
is a context andk
is an action. Csabas-slides
- The Multi Fidelity Multi-armed Bandit : MFMAB