Home - sijiaxu/Overkill GitHub Wiki

Overkill currently contain two AI systems:

Macro-action selection. Using reinforcement learning to choose the next macro-action in the game. macro-action include almost all the building-related actions and attack commands(roughly total 25 actions):
- build battle-unit(zergling, hydralisk, mutalisk, lurker, scourge, ultralisk).
- build tech-building.
- expand base.
- trigger attack command(all-in attack, mutalisk attack)
- build defend buildings.
- research some key upgrade.
opening strategy selection. Using UCB1 to select the best opening strategy at the the beginning of the game according to the previous match result.

Reinforcement learning

due to the lack of computer resource(only have one computer to train model), lack of the training framework and the various/unexpected strategy used by different bots, the main goal of this year's RL model is to quickly adaptive to each opponent's strategy and improve wining rate in relative short time.

model

current RL model use DDQN and prioritized experience replay with self-implemented neural network and many tuned parameters which include:

using TD(n) instead of TD(0) as backup. although it is not theoretically correct to use TD(n) in a off-policy way, but if n is small, the return of TD(n) is roughly reasonable and can get a less biased estimation than TD(0).
RMSprop.
experience pool size.
hyper-parameter of regulation.
gradient clipping.
early stop.(still need to find a good implementation)

besides, below is some thinking and compare against other RL methods:

policy-based methods:
- stochastic policy gradient(MC with reinforce and with experience pool). I first implement this method, although doing well against computer AI, but it can not beat the bot. the main reason here maybe SPG is a on-policy method, but using experience pool require a off-policy method, so doing things this way is not theoretically correct. besides, the variance caused by MC backup is still very high, although introducing baseline to reduce that.
- A3C. properly the most suitable method, SPG+TD(n), both avoiding the use of experience pool to get good usage of recent samples and using the TD(n) to get a good balance between bias and variance. but A3C need multiple environments while I don't have enough computer resource, so currently can not use that.
- DDPG. maybe can do well than SPG, but I can't figure out how to do the mapping from actor-network's discrete output(n unit) to Q-network's input(one unit) while still keep the back-propagation working.
value-based methods:
- a series of optimization based on DQN, including DDQN, prioritized experience replay and duel network. because currently the number of training instances generated by one match is relative small(roughly 300, only doing macro-action selection in every two seconds), a deeper NN is harder to train well in short time while duel network needs to add two more layers, so currently not use that.

experiment

below is the win rate graph of Overkill against 2015-AIIDE-ualbertabot(Random) in 180 matches.

all matches use the fixed opening strategy(nine-pool) and in the same map(Python). the explore-rate decay linearly from 100% to 10% in the first 50 matches and fix at 10% in the later matches.

it can be seen from the replay that, Overkill learn to build multiple zerg-sunkens to defend ualbertabot's early rush, then upgrade tech to produce lurker or mutalisk to effectively counter Zealot/Zerg/Marine and do some base expand to get enough resource.

the win rate of latest 30 matches are 100%, so the training do early-stop(note that this criterion may not work universally, the win rate against some opponent can not reach 100%).

problems

there are lot of ideas/parameters to try, but the training of RL is really time consuming. for example, on my computer I can only play roughly 300-400 matches a day which just enough to verify one idea's effect. so the overall speed of experiment is very slow.
it is still very hard to find a long and precise Macro-action sequence. for example, when play against Ximp with the ten-hatchery-opening, the only way to win is to produce mutalisk as soon as possiable, the time left for other mistaken actions is very limited. but the Macro-action from ten-hatchery-opening to the attack command of mutalisk is very long(about 5 minutes' play) and it is hard for bot to get the reward.
despite of the Macro part, the micro-management part also play a key role or even more important part in winning the game, but the action space and the complexicty of micro-mangement is more challenging than Macro-action's problems.

any suggestion or advice here is appreciated :)