Hyperparameter Optimization - werner-duvaud/muzero-general GitHub Wiki
MuZero General has an asynchronous hyperparameter search method. It uses Nevergrad behind the hood to search the hyperparameter space.
By default it is optimizing the learning rate and the discount rate. You can edit this parametrization in the __main__ of muzero.py. More details about the parametrization is available here.
This page is dedicated to documenting hyperparameters and gathering advice for tuning them. You can add your advice. You can also discuss it on the discord server.
Here are some references for AlphaZero hyperparameters which are quite similar to those of MuZero:
- Lessons from AlphaZero (part 3): Parameter Tweaking
- Lessons From Alpha Zero (part 6) — Hyperparameter Tuning
Table of contents
- seed
- max_num_gpus
- Game
- Evaluate
- Self-Play
- Network
- Training
- Replay Buffer
- Adjust the self play / training ratio to avoid over/underfitting
- visit_softmax_temperature_fn
seed
Seed for numpy, torch and the game.
max_num_gpus
Fix the maximum number of GPUs to use. It's usually faster to use a single GPU (set it to 1) if it has enough memory. None will use every GPUs available.
Game
observation_shape
Dimensions of the game observation, must be 3D (channel, height, width). For a 1D array, please reshape it to (1, 1, length of array).
action_space
Fixed list of all possible actions. You should only edit the length.
players
List of players. You should only edit the length.
stacked_observations
Number of previous observations and previous actions to add to the current observation.
Evaluate
muzero_player
Turn Muzero begins to play (0: MuZero plays first, 1: MuZero plays second).
opponent
Hard coded agent that MuZero faces to assess his progress in multiplayer games. It doesn't influence training. None, "random" or "expert" if implemented in the Game class.
Self-Play
num_actors
Number of simultaneous threads self-playing to feed the replay buffer.
selfplay_on_gpu
True / False. Use the GPUs for the selfplay.
max_moves
Maximum number of moves if game is not finished before.
num_simulations
Number of future moves self-simulated.
discount
Chronological discount of the reward. Should be set to 1 for board games with a single reward at the end of the game.
temperature_threshold
Number of moves before dropping the temperature given by visit_softmax_temperature_fn to 0 (ie selecting the best action). If None, visit_softmax_temperature_fn is used every time.
Root prior exploration noise
root_dirichlet_alpha
root_exploration_fraction
UCB formula
pb_c_base
pb_c_init
Network
network
Select the type of network to use: "resnet" / "fullyconnected"
support_size
Value and reward are scaled (with almost sqrt) and encoded on a vector with a range of -support_size to support_size
Residual Network
downsample
Downsample observations before representation network (See paper appendix Network Architecture).
blocks
Number of blocks in the ResNet.
channels
Number of channels in the ResNet.
reduced_channels_reward
Number of channels in reward head.
reduced_channels_value
Number of channels in value head.
reduced_channels_policy
Number of channels in policy head.
resnet_fc_reward_layers
Define the hidden layers in the reward head of the dynamic network.
resnet_fc_value_layers
Define the hidden layers in the value head of the prediction network.
resnet_fc_policy_layers
Define the hidden layers in the policy head of the prediction network.
Fully Connected Network
encoding_size
fc_representation_layers
Define the hidden layers in the representation network.
fc_dynamics_layers
Define the hidden layers in the dynamics network.
fc_reward_layers
Define the hidden layers in the reward network.
fc_value_layers
Define the hidden layers in the value network.
fc_policy_layers
Define the hidden layers in the policy network.
Training
results_path
Path to store the model weights and TensorBoard logs.
training_steps
Total number of training steps (ie weights update according to a batch).
batch_size
Number of parts of games to train on at each training step.
checkpoint_interval
Number of training steps before using the model for self-playing.
value_loss_weight
Scale the value loss to avoid overfitting of the value function, paper recommends 0.25 (See paper appendix Reanalyze).
train_on_gpu
True / False. Use the GPUs for the training.
optimizer
"Adam" or "SGD". Paper uses SGD.
weight_decay
Coefficient of the L2 weights regularization.
momentum
Used only if optimizer is SGD.
Exponential learning rate schedule
lr_init
Initial learning rate.
lr_decay_rate
Set it to 1 to use a constant learning rate. ###lr_decay_steps
Replay Buffer
window_size
Number of self-play games to keep in the replay buffer.
num_unroll_steps
Number of game moves to keep for every batch element.
td_steps
Number of steps in the future to take into account for calculating the target value. Should be equal to max_moves for board games with a single reward at the end of the game.
PER
Prioritized Replay (See paper appendix Training). Select in priority the elements in the replay buffer which are unexpected for the network.
use_max_priority
If False, use the n-step TD error as initial priority. Better for large replay buffer.
PER_alpha
How much prioritization is used, 0 corresponding to the uniform case, paper suggests 1.
Reanalyze (See paper appendix Reanalyse)
use_last_model_value
Use the last model to provide a fresher, stable n-step value (See paper appendix Reanalyze).
reanalyse_on_gpu
True / False. Use the GPUs for the reanalyse.
Adjust the self play / training ratio to avoid over/underfitting
self_play_delay
Number of seconds to wait after each played game.
training_delay
Number of seconds to wait after each training step.
ratio
Desired training steps per self played step ratio. Equivalent to a synchronous version, training can take much longer. Set it to None to disable it.
visit_softmax_temperature_fn
Parameter to alter the visit count distribution to ensure that the action selection becomes greedier as training progresses. The smaller it is, the more likely the best action (ie with the highest visit count) is chosen.