Monte Carlo Tree Search - reporkey/Berkeley-Pacman GitHub Wiki
Why to choose
Monte Carlo Tree search (MCTS) is an anytime algorithm to slove Markov decision processes (MDPs) problem. The level of its optimality totally depends on the size of problem and the time budget given. In this Pacman game time budget given is 1 secound for each movement. In order to maximize resource utilization and explore the upper bound of performance, we implemented a MCTS agent.
Background
MDPs can be represented as trees (or graphs), called ExpectiMax trees. MCTS is the processes to build the tree and find the most valuable branch.
(Figure source: https://www.aaai.org/Papers/AIIDE/2008/AIIDE08-036.pdf)
A vanilla MCTS procedure iterates following four steps until terminate:
-
Select: select a single leaf node in the tree to assess, according to the statistics stored, in a way that balances between exploitation and exploration.
-
Expand: Expand this node by applying one available action (as defined by the MDP) from the node.
-
Simulation: From one of the new nodes, perform a complete random simulation of the MDP to a terminating state. This therefore assumes that the search tree is finite, but versions for infinitely large trees exist in which we just execute for some time and then estimate the outcome.
-
Back propagation: Finally, the value of the node is backpropagated to the root node, updating the value of each ancestor node on the way using expected value.
Implementaions
Embedded reinforcement learning
https://gitlab.eng.unimelb.edu.au/zichunz/comp90054-pacman/tree/mct
Reward: we directly use the Q-value from Approx Q as the rewards. Therefore in fact the simulation is replaced by evaluetion from Approx Q. And also in this implementaiton "Expandsion" and "Simulation" consider all visible agents (<5 manhattan distance). Enemy agent use the same funcion of evalution in simulation, but get a negative value. During back propagation, we take the method of Flat Monte Carlo Tree Search that calculate the average reward of all children and pick the maximum one as the best action. The aim of this is trying to take benefits of weight update from reinforcement, but since this whole system is complex, any change of the structure may influence the overall preformance.
Naive (discarded)
This is the simplest MCTS with randomly selection and randomly simulation. It preform terrible even with 10s given. The reward is set to each cells.
Embedded policies from Value Iteration (discarded)
Given policies to simulate, it at least able to walk out from born place:). The main issue here is comlexity. To get a value iteration solution is slow, and also the simulation may call multiple times of value iteration. Thus it end up with only a dozen of expend nodes in 1s. It's risky to believe the action because of its high variance. In order to expand more nodes in a unit time, we do not corcern other agents' movement. So that we only need to call value iteration when the rewards on map changes, which is only caused by the only agent.
Performance
Demo
Red: MCTS
Blue: Baseline