Multi Representation Search - shivamvats/notes GitHub Wiki

MHA* paper used heuristics to induce behaviour that is better characterized by actions.

Say, you begin with two action spaces: A1(x, y, theta) and A2(a1, a2, a3). Additionally, you have a set of macro actions M = {outstretch-hands, rotate-in-place, pull-hands-close}. Say, A1 and A2 are being used with equal probability and the robot gets stuck at a narrow door. Our algorithm should be able to detect this and start another queue who's action space contains pull-hands-close to get through the door.

Ways to formalize an expectation-exploitation problem:

  1. Multi-armed Bandit (MAB)
  2. Markov Decision Process (MDP)
  3. Partially Observable MDP (POMDP)

MAB

Setting: Choose one of k actions. Each selection results in some reward sampled from a fixed probability distribution associated with that action.

Goal: Maximize reward over some horizon (or minimize regret?)

When to use

  • There is no notion of a state. The choice of action is independent of the current state.
  • The agent's actions don't change the environment.
  • The reward probability distribution is fixed.

Dynamic MAB

This makes the model a bit more general by allowing that the underlying reward probability distributions are dynamic and hence are changing stochastically and gradually over time.

⚠️ **GitHub.com Fallback** ⚠️