API 2.1.2. NeuralGenDice - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
NeuralGenDice
NeuralGenDice
estimates the policy value $\rho^\pi$ by first approximating the stationary distribution correction $w_{\pi/D}: S \times A \to R_{\geq 0}$ in the tabular setting. It is a gradient-based estimator that parameterizes the function arguments of the dual GenDICE objective with neural networks and performs stochastic gradient descent-ascent based on logged data. It inherits from NeuralDice
, but overrides all the necessary base methods.
Unlike NeuralDualDice
, NeuralGenDice
supports both the discounted and undiscounted case, i.e., $0 < \gamma \leq 1$. However, for the latter, one must chose a positive norm regularization coefficient $\lambda > 0$.
🧮 Mathematical Formulation
Fenchel-Rockefeller duality is applied to the primal GenDICE objective:
$$ J(w) \doteq D_\phi( D^D w ~ | ~ \mathcal B^\pi_T D^D w ) + \frac{\lambda}{2} \left ( E_{ (s, a) \sim d^D } [ w(s, a) ] - 1 \right )^2, \quad \phi(x) = (x - 1)^2. $$
Note that:
- $D_\phi$ is the f-divergence based on $\phi$,
- $D^D$ is the diagonal matrix of $d^D$, the dataset distribution,
- $B^\pi_T$ is the backward Bellman operator,
- $\lambda$ is a Lagrange multiplier that penalizes deviations of the expectation of the stationary distribution candidate $w$ under the dataset distribution $d^D$ from $1$.
This yields the dual GenDICE objective:
$$ J(v, w) \doteq E_{ s_0 \sim d_0, ~ (s, a) \sim d^D, ~ s' \sim T(s, a) } \left[ L(v, w; s_0, s, a, s') + N(w, u; s, a) - \frac{1}{4} v(s, a)^2 w(s, a) \right ]. $$
The loss term $L(v, w; s_0, s, a, s')$ is the same as in DualDICE:
$$ L(v, w; s_0, s, a, s') \doteq (1 - \gamma) E_{ a_0 \sim \pi(s_0) } [ v(s_0, a_0) ] + w(s, a) \left ( \gamma E_{a' \sim \pi(s')} [ v(s', a') ] - v(s, a) \right ). $$
The new norm regularization term $N(w, u; s, a)$ and coefficient $\lambda$ for the divergence are:
$$ N(w, u; s, a) \doteq \lambda \left ( u ( w(s, a) - 1 ) - \frac{1}{2} u^2 \right ), \quad \lambda \geq 0. $$
The norm variable $u$ is an additional parameter learned by the model.
The optimization problem alternates between maximizing $J(v, w)$ with respect to $v$ and minimizing it with respect to $w$, along with the regularization and penalty terms. The stationary distribution correction $w_{\pi / D}$ is estimated based on the optimized $v^\ast$.
For further details, refer to
- the Bellman Equations wiki page,
- the original paper: GenDICE: Generalized Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
🏗️ Constructor
def __init__(
self,
gamma, lamda, seed, batch_size,
learning_rate, hidden_dimensions,
obs_min, obs_max, n_act, obs_shape,
dataset, preprocess_obs=None, preprocess_act=None, preprocess_rew=None,
dir=None, get_recordings=None, other_hyperparameters=None, save_interval=100):
Args:
- All the arguments of
NeuralDice
are inherited. lamda
(float): Norm regularization coefficient $\lambda$.
$\lambda$ is passed as
lamda
in the Python code due to the reserved keywordlambda
in Python.
💵 Loss
def get_loss(self, v_init, v, v_next, w):
Overrides the base class get_loss
to compute the dual GenDICE objective.
⚙️ Utility
def set_up_networks(self):
In addition to the base method, a tf.Variable
and a corresponding SGD
optimizer are set up for the norm variable $u$.
🧪 Example
from some_module import NeuralGenDice
estimator = NeuralGenDice(
gamma=0.99,
lamda=0.5,
seed=0,
batch_size=64,
learning_rate=1e-3,
hidden_dimensions=(64, 64),
obs_min=obs_min,
obs_max=obs_max,
n_act=4,
obs_shape=(8,),
dataset=df,
dir="./logs"
)
estimator.evaluate_loop(n_steps=10_000)
rho_hat = estimator.solve_pv(weighted=True)