Hyperparameter Optimization - werner-duvaud/muzero-general GitHub Wiki

MuZero General has an asynchronous hyperparameter search method. It uses Nevergrad behind the hood to search the hyperparameter space. By default it is optimizing the learning rate and the discount rate. You can edit this parametrization in the __main__ of muzero.py. More details about the parametrization is available here.

This page is dedicated to documenting hyperparameters and gathering advice for tuning them. You can add your advice. You can also discuss it on the discord server.

Here are some references for AlphaZero hyperparameters which are quite similar to those of MuZero:

seed
max_num_gpus
Game
Evaluate
- muzero_player
- opponent
Self-Play
- Root prior exploration noise
  - root_dirichlet_alpha
  - root_exploration_fraction
- UCB formula
  - pb_c_base
  - pb_c_init
Network
- network
- support_size
- Residual Network
- Fully Connected Network
Training
- Exponential learning rate schedule
  - lr_init
  - lr_decay_rate
Replay Buffer
- Reanalyze (See paper appendix Reanalyse)
  - use_last_model_value
  - reanalyse_on_gpu
Adjust the self play / training ratio to avoid over/underfitting
visit_softmax_temperature_fn

seed

Seed for numpy, torch and the game.

max_num_gpus

Fix the maximum number of GPUs to use. It's usually faster to use a single GPU (set it to 1) if it has enough memory. None will use every GPUs available.

Game

observation_shape

Dimensions of the game observation, must be 3D (channel, height, width). For a 1D array, please reshape it to (1, 1, length of array).

action_space

Fixed list of all possible actions. You should only edit the length.

players

List of players. You should only edit the length.

stacked_observations

Number of previous observations and previous actions to add to the current observation.

Evaluate

muzero_player

Turn Muzero begins to play (0: MuZero plays first, 1: MuZero plays second).

opponent

Hard coded agent that MuZero faces to assess his progress in multiplayer games. It doesn't influence training. None, "random" or "expert" if implemented in the Game class.

Self-Play

num_actors

Number of simultaneous threads self-playing to feed the replay buffer.

selfplay_on_gpu

True / False. Use the GPUs for the selfplay.

max_moves

Maximum number of moves if game is not finished before.

num_simulations

Number of future moves self-simulated.

discount

Chronological discount of the reward. Should be set to 1 for board games with a single reward at the end of the game.

temperature_threshold

Number of moves before dropping the temperature given by visit_softmax_temperature_fn to 0 (ie selecting the best action). If None, visit_softmax_temperature_fn is used every time.

Root prior exploration noise

root_dirichlet_alpha

root_exploration_fraction

UCB formula

pb_c_base

pb_c_init

Network

network

Select the type of network to use: "resnet" / "fullyconnected"

support_size

Value and reward are scaled (with almost sqrt) and encoded on a vector with a range of -support_size to support_size

Residual Network

downsample

Downsample observations before representation network (See paper appendix Network Architecture).

blocks

Number of blocks in the ResNet.

channels

Number of channels in the ResNet.

reduced_channels_reward

Number of channels in reward head.

reduced_channels_value

Number of channels in value head.

reduced_channels_policy

Number of channels in policy head.

resnet_fc_reward_layers

Define the hidden layers in the reward head of the dynamic network.

resnet_fc_value_layers

Define the hidden layers in the value head of the prediction network.

resnet_fc_policy_layers

Define the hidden layers in the policy head of the prediction network.

Fully Connected Network

encoding_size

fc_representation_layers

Define the hidden layers in the representation network.

fc_dynamics_layers

Define the hidden layers in the dynamics network.

fc_reward_layers

Define the hidden layers in the reward network.

fc_value_layers

Define the hidden layers in the value network.

fc_policy_layers

Define the hidden layers in the policy network.

Training

results_path

Path to store the model weights and TensorBoard logs.

training_steps

Total number of training steps (ie weights update according to a batch).

batch_size

Number of parts of games to train on at each training step.

checkpoint_interval

Number of training steps before using the model for self-playing.

value_loss_weight

Scale the value loss to avoid overfitting of the value function, paper recommends 0.25 (See paper appendix Reanalyze).

train_on_gpu

True / False. Use the GPUs for the training.

optimizer

"Adam" or "SGD". Paper uses SGD.

weight_decay

Coefficient of the L2 weights regularization.

momentum

Used only if optimizer is SGD.

Exponential learning rate schedule

lr_init

Initial learning rate.

lr_decay_rate

Set it to 1 to use a constant learning rate. ###lr_decay_steps

Replay Buffer

window_size

Number of self-play games to keep in the replay buffer.

num_unroll_steps

Number of game moves to keep for every batch element.

td_steps

Number of steps in the future to take into account for calculating the target value. Should be equal to max_moves for board games with a single reward at the end of the game.

PER

Prioritized Replay (See paper appendix Training). Select in priority the elements in the replay buffer which are unexpected for the network.

use_max_priority

If False, use the n-step TD error as initial priority. Better for large replay buffer.

PER_alpha

How much prioritization is used, 0 corresponding to the uniform case, paper suggests 1.

Reanalyze (See paper appendix Reanalyse)

use_last_model_value

Use the last model to provide a fresher, stable n-step value (See paper appendix Reanalyze).

reanalyse_on_gpu

True / False. Use the GPUs for the reanalyse.

Adjust the self play / training ratio to avoid over/underfitting

self_play_delay

Number of seconds to wait after each played game.

training_delay

Number of seconds to wait after each training step.

ratio

Desired training steps per self played step ratio. Equivalent to a synchronous version, training can take much longer. Set it to None to disable it.

visit_softmax_temperature_fn

Parameter to alter the visit count distribution to ensure that the action selection becomes greedier as training progresses. The smaller it is, the more likely the best action (ie with the highest visit count) is chosen.