Approximate Q Learning - reporkey/Berkeley-Pacman GitHub Wiki
Theory
“Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any all successive steps, starting from the current state, Function approximation may speed up learning in finite problems.”[2]
The deep Q-learning was firstly tried, because it’s a technology used in the Alpha-Go, and the trained weight would be more accurate. However, it’s hard to code the neural network, especially when conjuncing it with Q-learning. Then the Approximate Q-learning was tried. The approximate Q-learning approximated the exact Q-function, so we don’t need a large Q-table(thus speeding the training process) or a complicated ANN function.
AQ doesn’t like MCTS at treating rewards. it uses bootstrapping instead of reaching or backpropagating a reward. Sometimes the rewards are sparse, so the AQ algorithm can’t find good rewards.
Evolution and Experiments
The process of approximate Q-learning training was not smooth.
The pacmans stop eating opposite foods in a certain food-eating stage after about 100- rounds training. It’s a problem caused by rewards setting - the rewards are too sparse (there were only score counted, which is the number of food successfully returned). Theoretically speaking, it shouldn't have been a problem, because most commonly used RLs are able to handle it automatically. However, it can lead to a longer term of training to converge estimate and variance, which appears that the “pacmans stop eating”. So, we made a decision to add more frequent, small pieces of rewards to speed up learning.
After fixing the problem, the performance of the AQ algorithm still was not very good. There are some possible reasons:
- As our model of RL is linear, the dependence between features can lead to worse performance.
- The setting of rewards is not excellent.
- The training is not enough.
Performance
Overall the performance is not good as MCTS.
https://gitlab.eng.unimelb.edu.au/zichunz/comp90054-pacman/tree/approxq
Demo
Red: Approx Q
Blue: Baseline