API 2.1.3. NeuralGradientDice - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

`NeuralGradientDice`

NeuralGradientDice estimates the policy value $\rho^\pi$ by first approximating the stationary distribution correction $w_{\pi/D}: S \times A \to R_{\geq 0}$ in the tabular setting. It is a gradient-based estimator that parameterizes the function arguments of the dual GradientDICE objective with neural networks and performs stochastic gradient descent-ascent based on logged data. It inherits from NeuralGenDice, but overrides the method get_loss.

Just like NeuralGenDice, NeuralGradientDice supports both the discounted and undiscounted case, i.e., $0 < \gamma \leq 1$. However, for the latter, one must chose a positive norm regularization coefficient $\lambda > 0$.

🧮 Mathematical Formulation

Fenchel-Rockefeller duality is applied to the primal GradientDICE objective from TabularGradientDice, to yield the dual GradientDICE objective:

$$ J(v, w) \doteq E_{ s_0 \sim d_0, ~ (s, a) \sim d^D, ~ s' \sim T(s, a) } \left [ L(v, w; s_0, s, a, s') + N(w, u; s, a) - \frac{1}{2} v(s, a)^2 \right ]. $$

The dual objective in GradientDICE is the same as in GenDICE, except for a slight modification in the last term of the loss function: $\frac{1}{2} v(s, a)^2$ instead of $\frac{1}{4} v(s, a)^2 w(s, a)$.

The loss term $L(v, w; s_0, s, a, s')$ is the same as in GenDICE:

$$ L(v, w; s_0, s, a, s') \doteq (1 - \gamma) E_{ a_0 \sim \pi(s_0) } [ v(s_0, a_0) ] + w(s, a) \left ( \gamma E_{a' \sim \pi(s')} [ v(s', a') ] - v(s, a) \right ). $$

This also goes for the norm regularization term $N(w, u; s, a)$ and coefficient $\lambda$:

$$ N(w, u; s, a) \doteq \lambda \left ( u ( w(s, a) - 1 ) - \frac{1}{2} u^2 \right ), \quad \lambda \geq 0. $$

For further details, refer to the original paper: GradientDICE: Efficient Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

🏗️ Constructor

def __init__(
        self,
        gamma, lamda, seed, batch_size,
        learning_rate, hidden_dimensions,
        obs_min, obs_max, n_act, obs_shape,
        dataset, preprocess_obs=None, preprocess_act=None, preprocess_rew=None,
        dir=None, get_recordings=None, other_hyperparameters=None, save_interval=100):

Args:

All the arguments of NeuralGenDice are inherited.

💵 Loss

def get_loss(self, v_init, v, v_next, w):

Overrides the base class get_loss to compute the dual GradientDICE objective.

🧪 Example

from some_module import NeuralGradientDice

estimator = NeuralGradientDice(
    gamma=0.99,
    lamda=0.5,
    seed=0,
    batch_size=64,
    learning_rate=1e-3,
    hidden_dimensions=(64, 64),
    obs_min=obs_min,
    obs_max=obs_max,
    n_act=4,
    obs_shape=(8,),
    dataset=df,
    dir="./logs"
)

estimator.evaluate_loop(n_steps=10_000)

rho_hat = estimator.solve_pv(weighted=True)