API 1.1.1. OnPE - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

`OnPE` (On-Policy Evaluation Estimator)

The OnPE class implements a simple On-Policy Evaluation estimator for tabular trajectories. It performs evaluation using Monte Carlo returns by averaging discounted rewards observed in logged on-policy data. This class estimates the expected return of a policy based solely on the empirical returns from trajectories collected under the same policy.

🧮 Mathematical Background

Recall that the return of a trajectory $\tau$ is defined as

$$ R^\gamma(\tau) \doteq \begin{cases} (1 - \gamma) \sum_{t=0}^{H-1} \gamma^t r_t, & 0 < \gamma < 1, \ \frac{1}{H} \sum_{t=0}^{H-1} r_t, & \gamma = 1. \ \end{cases} $$

When performing on-policy evaluation, we sample a dataset of trajectories $( \tau_i )_{i=1}^m \sim \pi$. Then we consider the on-policy estimator

$$ \hat \rho^\pi \doteq \frac{1}{m} \sum_{i=1}^m R_i, \quad R_i \doteq R^\gamma(\tau_i). $$

🏗️ Constructor

def __init__(self, dataset):

Args:

dataset (pd.DataFrame): Dataset with columns:
- id (int): Episode identifier.
- t (int): Time step index $t$.
- rew (float): Reward $R(s, a, s')$.

📦 Properties

@property
def dataset(self):

Returns the current dataset.

@OnPE.dataset.setter
def dataset(self, dataset):

Sets and sorts the dataset by trajectory and timestep, and precomputes rewards (r), timesteps (t), number of trajectories (m), and trajectory lengths (H).

🚀 Solve

def solve(self, gamma, **kwargs):

Estimates the policy value from the dataset using discounted rewards.