2016 wiki - sijiaxu/Overkill GitHub Wiki

Overkill currently contain two main AI system:

building unit selection. Using reinforcement learning model to choose the most proper unit to build(currently only three types of unit to choose: zerglings,mutalisk,hydralisk) according to current game status, and saving the match info to continually improve the RL model itself.
opening strategy selection. Using UCB1 to select the best opening strategy at the the beginning of the game according to the previous match result against this opponent.

Reinforcement learning

the main thinking here to use RL to predict the unit is, the most proper unit to build under a specific status is fixed, despite the randomness of bot micromanagement and army tactic/position.

So this model can be explored by reinforcement learning.

RL code is mainly in StrategyManager.cpp.

model

the RL model does Q-value prediction with function approximation using linear model. after the opening stage in each match if current production queue is empty(the minimal model function call interval is 10 seconds) Overkill use RL model to get the highest Q-value action as the next building unit type and save the feature value in each model function call as instance.

after match end Overkill using the match result as the backup value with a discount rate for each instance and do several rounds of SGD to get better model. I have tried using the one-step Q-value prediction as the backup value for each instance, but the result was worse.

the exploration rate is increasing up to 0.9 during training phrase, and set to 0.9 at official match.

feature

all features are binary, binary feature has several advantages in linear model:

discretize continuous feature to several binary features is equal to increase the original feature space dimension, which may help the linear model to fit better(erase some non-linear in the origin data space).
no need to normalization
converge faster

there are three types of feature in current Overkill:

state feature. all current game status info: unit, economy, tech
state combined feature: our battle unit feature combine enemy battle unit feature
state action combined feature: action feature combine state combined feature and action feature combine some tech feature

the total distinct feature number is about 4000 and each instance has roughly 100 features. so the instance's feature space is sparse.

learning method

model's learning method is SGD, mini-batch size is 10, but has several modification:

using experience pool to store the latest 3000 instance and randomly sample the instance from the pool to do the SGD. detail thinking/benifit and hyper-parameter setting can find in this paper Human-level control through deep reinforcement learning
using per-coordinate learning rates for each feature. this is similar to the idea in FTRL-Proximal, the main consideration here is it's not a good solution that we decrease learning rate for all features because most of the features do not appear in current SGD round.

experiment result

below is some statistics against IceBot(2015 AIIDE version) over 600 matches in 10 maps(2015 AIIDE competition map) during training phrase.

the exploration rate is set to 0.5 at first, improve 0.1 for every 50 match and stop at 0.9 eventually.

the win rate is calculated for every 50 matches.

the average loss is calculated for every 500 round SGD

the average Q-value is calculated for every 500 instance

according to these statistics, it seems that the RL mode is converging.

remain problems

the main problem remain is the model's converge is sometimes unstable. e.g., the model is converging smoothly at the first 200 match, but then diverge for the next 50-100 match.

decreasing the hyper-parameter of learning rate can alleviate this problem but still exist.

it may caused by some randomness during the game, e.g. the same battle unit at different map position or using different tactics may yield the different battle result and some programming bugs..