API 2.1.2. NeuralGenDice - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
NeuralGenDice
estimates the policy value NeuralDice
, but overrides all the necessary base methods.
Unlike NeuralDualDice
, NeuralGenDice
supports both the discounted and undiscounted case, i.e.,
Fenchel-Rockefeller duality is applied to the primal GenDICE objective:
Note that:
-
$D_\phi$ is the f-divergence based on$\phi$ , -
$D^D$ is the diagonal matrix of$d^D$ , the dataset distribution, -
$B^\pi_T$ is the backward Bellman operator, -
$\lambda$ is a Lagrange multiplier that penalizes deviations of the expectation of the stationary distribution candidate$w$ under the dataset distribution$d^D$ from$1$ .
This yields the dual GenDICE objective:
The loss term
The new norm regularization term
The norm variable
The optimization problem alternates between maximizing
For further details, refer to
- the Bellman Equations wiki page,
- the original paper: GenDICE: Generalized Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
def __init__(
self,
gamma, lamda, seed, batch_size,
learning_rate, hidden_dimensions,
obs_min, obs_max, n_act, obs_shape,
dataset, preprocess_obs=None, preprocess_act=None, preprocess_rew=None,
dir=None, get_recordings=None, other_hyperparameters=None, save_interval=100):
Args:
- All the arguments of
NeuralDice
are inherited. -
lamda
(float): Norm regularization coefficient$\lambda$ .
$\lambda$ is passed aslamda
in the Python code due to the reserved keywordlambda
in Python.
def get_loss(self, v_init, v, v_next, w):
Overrides the base class get_loss
to compute the dual GenDICE objective.
def set_up_networks(self):
In addition to the base method, a tf.Variable
and a corresponding SGD
optimizer are set up for the norm variable
from some_module import NeuralGenDice
estimator = NeuralGenDice(
gamma=0.99,
lamda=0.5,
seed=0,
batch_size=64,
learning_rate=1e-3,
hidden_dimensions=(64, 64),
obs_min=obs_min,
obs_max=obs_max,
n_act=4,
obs_shape=(8,),
dataset=df,
dir="./logs"
)
estimator.evaluate_loop(n_steps=10_000)
rho_hat = estimator.solve_pv(weighted=True)