API 1.3.1.1. TabularVafe - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
TabularVafe
estimates the policy value ModelBasedTabularOffPE
, but overrides all the necessary base methods.
In order to obtain the state-action value function, TabulaVafe
approximates the forward Bellman equations and solves them directly.
Dual to
TabulaVafe
,TabularDice
uses the backward Bellman equations and the stationary distribution correction$w_{\pi / D}$ to estimate the policy value$\rho^\pi$ .
For further details, refer to the Bellman Equations wiki page.
def solve_vaf(self, gamma, **kwargs):
Solves for the state-action value function
Args:
-
gamma
(float): Discount factor$\gamma$ . -
projected
(bool, kwargs): Whether to solve the version of the Bellman equations projected onto the support of the dataset distribution$d^D$ .
Returns:
-
Q_hat
(np.ndarray): Approximated state-action value function$\hat Q^\pi$ . -
info
(dict): Additional metadata from the solver.
def solve_pv(self, gamma, Q_hat, info):
Estimates the policy value
Args:
-
gamma
(float): Discount factor$\gamma$ . -
Q_hat
(np.ndarray): Approximated state-action value function$\hat Q^\pi$ . -
info
(dict): Additional metadata from the solver.
Returns:
-
rho_hat
(float): Estimated policy value$\hat \rho^\pi$ .
def solve(self, gamma, **kwargs):
Runs the full evaluation pipeline: estimates solve_vaf
and computes solve_pv
.
vafe = TabularVafe(dataset, n_obs=5, n_act=3)
gamma = 0.99
rho_hat, info = vafe.solve(gamma, projected=True)
Q_hat = info["Q_hat"]