API 1.1.1. OnPE - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

OnPE (On-Policy Evaluation Estimator)

The OnPE class implements a simple On-Policy Evaluation estimator for tabular trajectories. It performs evaluation using Monte Carlo returns by averaging discounted rewards observed in logged on-policy data. This class estimates the expected return of a policy based solely on the empirical returns from trajectories collected under the same policy.

๐Ÿงฎ Mathematical Background

Recall that the return of a trajectory $\tau$ is defined as

$$ R^\gamma(\tau) \doteq \begin{cases} (1 - \gamma) \sum_{t=0}^{H-1} \gamma^t r_t, & 0 < \gamma < 1, \ \frac{1}{H} \sum_{t=0}^{H-1} r_t, & \gamma = 1. \ \end{cases} $$

When performing on-policy evaluation, we sample a dataset of trajectories $( \tau_i )_{i=1}^m \sim \pi$. Then we consider the on-policy estimator

$$ \hat \rho^\pi \doteq \frac{1}{m} \sum_{i=1}^m R_i, \quad R_i \doteq R^\gamma(\tau_i). $$

๐Ÿ—๏ธ Constructor

def __init__(self, dataset):

Args:

  • dataset (pd.DataFrame): Dataset with columns:
    • id (int): Episode identifier.
    • t (int): Time step index $t$.
    • rew (float): Reward $R(s, a, s')$.

๐Ÿ“ฆ Properties

@property
def dataset(self):

Returns the current dataset.

@OnPE.dataset.setter
def dataset(self, dataset):

Sets and sorts the dataset by trajectory and timestep, and precomputes rewards (r), timesteps (t), number of trajectories (m), and trajectory lengths (H).

๐Ÿš€ Solve

def solve(self, gamma, **kwargs):

Estimates the policy value from the dataset using discounted rewards.

Args:

  • gamma (float): Discount factor $\gamma$.
  • scale (bool, kwargs): Whether to scale the returns by (1 - gamma) if gamma < 1, or by inverse trajectory lengths 1/H if gamma == 1.
  • W (float or np.ndarray, kwargs): Optional weights for returns $R_i$.
  • m (int or float, kwargs): Normalization factor for averaging $m$.

Returns:

  • rho_hat (float): Estimated policy value $\hat \rho^\pi$.
  • info (dict): Empty dictionary (reserved for compatibility).

๐Ÿงช Example Usage

import pandas as pd

# Example dataset with trajectory IDs, timesteps, and rewards
dataset = pd.DataFrame({
    "id": [0,0,1,1,1],
    "t": [0,1,0,1,2],
    "rew": [1.0, 2.0, 0.5, 0.0, 3.0]
})

estimator = OnPE(dataset)
gamma = 0.99

rho_hat, info = estimator.solve(gamma)
print(f"Estimated policy value: {rho_hat}")