API 1.1.1. OnPE - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
OnPE
(On-Policy Evaluation Estimator)
The OnPE
class implements a simple On-Policy Evaluation estimator for tabular trajectories. It performs evaluation using Monte Carlo returns by averaging discounted rewards observed in logged on-policy data. This class estimates the expected return of a policy based solely on the empirical returns from trajectories collected under the same policy.
๐งฎ Mathematical Background
Recall that the return of a trajectory $\tau$ is defined as
$$ R^\gamma(\tau) \doteq \begin{cases} (1 - \gamma) \sum_{t=0}^{H-1} \gamma^t r_t, & 0 < \gamma < 1, \ \frac{1}{H} \sum_{t=0}^{H-1} r_t, & \gamma = 1. \ \end{cases} $$
When performing on-policy evaluation, we sample a dataset of trajectories $( \tau_i )_{i=1}^m \sim \pi$. Then we consider the on-policy estimator
$$ \hat \rho^\pi \doteq \frac{1}{m} \sum_{i=1}^m R_i, \quad R_i \doteq R^\gamma(\tau_i). $$
๐๏ธ Constructor
def __init__(self, dataset):
Args:
dataset
(pd.DataFrame): Dataset with columns:id
(int): Episode identifier.t
(int): Time step index $t$.rew
(float): Reward $R(s, a, s')$.
๐ฆ Properties
@property
def dataset(self):
Returns the current dataset.
@OnPE.dataset.setter
def dataset(self, dataset):
Sets and sorts the dataset by trajectory and timestep, and precomputes rewards (r
), timesteps (t
), number of trajectories (m
), and trajectory lengths (H
).
๐ Solve
def solve(self, gamma, **kwargs):
Estimates the policy value from the dataset using discounted rewards.
Args:
gamma
(float): Discount factor $\gamma$.scale
(bool, kwargs): Whether to scale the returns by(1 - gamma)
ifgamma < 1
, or by inverse trajectory lengths1/H
ifgamma == 1
.W
(float or np.ndarray, kwargs): Optional weights for returns $R_i$.m
(int or float, kwargs): Normalization factor for averaging $m$.
Returns:
rho_hat
(float): Estimated policy value $\hat \rho^\pi$.info
(dict): Empty dictionary (reserved for compatibility).
๐งช Example Usage
import pandas as pd
# Example dataset with trajectory IDs, timesteps, and rewards
dataset = pd.DataFrame({
"id": [0,0,1,1,1],
"t": [0,1,0,1,2],
"rew": [1.0, 2.0, 0.5, 0.0, 3.0]
})
estimator = OnPE(dataset)
gamma = 0.99
rho_hat, info = estimator.solve(gamma)
print(f"Estimated policy value: {rho_hat}")