Multi Representation Search - shivamvats/notes GitHub Wiki
MHA* paper used heuristics to induce behaviour that is better characterized by actions.
Say, you begin with two action spaces: A1(x, y, theta)
and A2(a1, a2, a3)
. Additionally, you have a set of macro actions M = {outstretch-hands, rotate-in-place, pull-hands-close}
. Say, A1 and A2 are being used with equal probability and the robot gets stuck at a narrow door. Our algorithm should be able to detect this and start another queue who's action space contains pull-hands-close
to get through the door.
Ways to formalize an expectation-exploitation problem:
- Multi-armed Bandit (MAB)
- Markov Decision Process (MDP)
- Partially Observable MDP (POMDP)
Setting: Choose one of k actions. Each selection results in some reward sampled from a fixed probability distribution associated with that action.
Goal: Maximize reward over some horizon (or minimize regret?)
When to use
- There is no notion of a state. The choice of action is independent of the current state.
- The agent's actions don't change the environment.
- The reward probability distribution is fixed.
This makes the model a bit more general by allowing that the underlying reward probability distributions are dynamic and hence are changing stochastically and gradually over time.