Hyperparameters - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

Hyperparameters (TODO)

This page documents the hyperparameters used in the offline policy evaluation algorithms implemented in this repository. These include settings for DICE-based methods, importance sampling estimators, and value function solvers.

General Notation

  • Let $\gamma \in (0, 1]$ be the discount factor.
  • Let $\pi$ denote the evaluation policy.
  • Let $\pi^{\mathcal D}$ denote the behavior policy.
  • Let $w_{\pi / \mathcal D}^\gamma$ denote the stationary distribution correction ratio.

Optimization Hyperparameters

For DICE methods (e.g., DualDICE, GenDICE):

  • learning_rate: Learning rate used by the optimizer (e.g., Adam).
  • batch_size: Number of samples per gradient update.
  • num_iterations: Total number of training iterations.
  • gradient_clip: Optional max-norm for gradient clipping.
  • optimizer: Optimizer type (usually adam).

Regularization Parameters:

  • entropy_coeff: Coefficient for entropy regularization (if applicable).
  • l2_weight: Weight for L2 regularization.
  • dual_reg_coeff: Coefficient used for dual variable regularization.

Network Architecture

For Neural Estimators:

  • hidden_sizes: List specifying the number of hidden units per layer (e.g., [256, 256]).
  • activation: Activation function used in hidden layers (e.g., relu, tanh).
  • use_layer_norm: Whether to apply layer normalization.

Evaluation

Estimation Settings:

  • num_eval_episodes: Number of episodes used for on-policy evaluation.
  • normalize_weights: Whether to normalize importance weights in estimators.
  • clip_weights: Whether to clip importance weights and, if so, the clipping threshold.

Dataset and Logging

  • seed: Random seed for reproducibility.
  • log_interval: How frequently to log metrics during training.
  • eval_interval: How frequently to evaluate the current model.
  • save_model: Whether to save the model checkpoints.

Notes

  • Specific hyperparameters may vary depending on the algorithm or benchmark.
  • See experiment scripts in the repository for actual default values and overrides.