Multi Armed Bandits - shivamvats/notes GitHub Wiki

Vanilla

Upper Confidence Bound (UCB): "Optimism in the face of uncertainty". Provides best-in-class regret for the stochastic setting.
Bayesian UCB : UCB assumes no prior on the reward distribution and hence has to depend on the Hoeffding inequality to compute the confidence interval. However, if we have some prior knowledge about the reward distribution (say, that it is Gaussian), then we can compute much tighter confidence intervals.
Thompson Sampling (TS): or Posterior Sampling or Probability Matching Competes with UCB in performance.
EXP3 : Exponential weights for exploration and exploitation. Provides best-in-class regret for the adversarial setting (hence is randomized).

Poor Man's Strategy: Assign a bandit to each context. Worst case regret is R_n = sqrt(n|A||C|), where A is the action space and C is space of contexts.

But this is bad. We need to assume more structure.

Linear Reward Model: Assume that the reward is a linear function of features F(c, k), where F is the feature map, c is a context and k is an action. Csabas-slides