API 1.1.1. OnPE - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
The OnPE
class implements a simple On-Policy Evaluation estimator for tabular trajectories. It performs evaluation using Monte Carlo returns by averaging discounted rewards observed in logged on-policy data. This class estimates the expected return of a policy based solely on the empirical returns from trajectories collected under the same policy.
Recall that the return of a trajectory
When performing on-policy evaluation, we sample a dataset of trajectories
def __init__(self, dataset):
Args:
-
dataset
(pd.DataFrame): Dataset with columns:-
id
(int): Episode identifier. -
t
(int): Time step index$t$ . -
rew
(float): Reward$R(s, a, s')$ .
-
@property
def dataset(self):
Returns the current dataset.
@OnPE.dataset.setter
def dataset(self, dataset):
Sets and sorts the dataset by trajectory and timestep, and precomputes rewards (r
), timesteps (t
), number of trajectories (m
), and trajectory lengths (H
).
def solve(self, gamma, **kwargs):
Estimates the policy value from the dataset using discounted rewards.
Args:
-
gamma
(float): Discount factor$\gamma$ . -
scale
(bool, kwargs): Whether to scale the returns by(1 - gamma)
ifgamma < 1
, or by inverse trajectory lengths1/H
ifgamma == 1
. -
W
(float or np.ndarray, kwargs): Optional weights for returns$R_i$ . -
m
(int or float, kwargs): Normalization factor for averaging$m$ .
Returns:
-
rho_hat
(float): Estimated policy value$\hat \rho^\pi$ . -
info
(dict): Empty dictionary (reserved for compatibility).
import pandas as pd
# Example dataset with trajectory IDs, timesteps, and rewards
dataset = pd.DataFrame({
"id": [0,0,1,1,1],
"t": [0,1,0,1,2],
"rew": [1.0, 2.0, 0.5, 0.0, 3.0]
})
estimator = OnPE(dataset)
gamma = 0.99
rho_hat, info = estimator.solve(gamma)
print(f"Estimated policy value: {rho_hat}")