Multi Armed Bandits - shivamvats/notes GitHub Wiki

  • Details in writings

Vanilla

  1. Upper Confidence Bound (UCB): "Optimism in the face of uncertainty". Provides best-in-class regret for the stochastic setting.
  2. Bayesian UCB : UCB assumes no prior on the reward distribution and hence has to depend on the Hoeffding inequality to compute the confidence interval. However, if we have some prior knowledge about the reward distribution (say, that it is Gaussian), then we can compute much tighter confidence intervals.
  3. Thompson Sampling (TS): or Posterior Sampling or Probability Matching Competes with UCB in performance.
  4. EXP3 : Exponential weights for exploration and exploitation. Provides best-in-class regret for the adversarial setting (hence is randomized).

Dynamic

  1. Dynamic Thompson Sampling

Contextual

Finitely Many Contexts

  • Poor Man's Strategy: Assign a bandit to each context. Worst case regret is R_n = sqrt(n|A||C|), where A is the action space and C is space of contexts.

But this is bad. We need to assume more structure.

  • Linear Reward Model: Assume that the reward is a linear function of features F(c, k), where F is the feature map, c is a context and k is an action. Csabas-slides

Multi-Fidelity

  1. The Multi Fidelity Multi-armed Bandit : MFMAB
⚠️ **GitHub.com Fallback** ⚠️