API 1.1.2. IS - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

`IS` (Importance Sampling Estimator)

The IS class extends the OnPE (On-Policy Evaluation) estimator by incorporating importance sampling. It performs off-policy evaluation by reweighting returns observed under a behavior policy using importance ratios. This includes both per-step and per-trajectory importance weights derived from the target and behavior policy probabilities, allowing for the estimation of expected returns under the target policy.

🧮 Mathematical Background

Recall the mathematical background for On-Policy Evaluation and that the policy ratio product of a trajectory $\tau$ is defined as

$$ W_{\pi / \pi^D}(\tau) \doteq \prod_{t=0}^{H-1} \frac{ \pi(a_t \mid s_t) }{ \pi^D(a_t \mid s_t) }. $$

For classical off-policy evaluation, we sample trajectories $( \tau_i )_{i=1}^m \sim \pi^D$ and define the simple and weighted importance sampling estimator, respectively,

$$ \hat \rho^\pi_S \doteq \frac{1}{m} \sum_{i=1}^m R_i W_i, \quad \hat \rho^\pi_W \doteq \frac{1}{M} \sum_{i=1}^m R_i W_i, \quad M \doteq \sum_{i=1}^m W_i, \quad W_i \doteq W_{\pi / \pi^{\mathcal D}}(\tau_i). $$

🏗️ Constructor

def __init__(self, dataset):

Args:

dataset (pd.DataFrame): Dataset with columns:
- id (int): Episode identifier.
- t (int): Time step index $t$.
- act (int): Action $a$.
- rew (float): Reward $R(s, a, s')$.
- probs_evaluation or probs (NDArray[float]): Action probabilities under the target policy at the current state $\pi(\cdot \mid s)$.
- probs_behavior (NDArray[float]): Action probabilities under the behavior policy at the current state $\pi^D(\cdot \mid s)$.

📦 Properties

@property
def dataset(self):

Returns the current dataset.

@OnPE.dataset.setter
def dataset(self, dataset):

Additionally to the setter from OnPE, it preprocesses the dataset by computing:

q: Per-step importance weights $q_t = \frac{\pi(a_t | s_t)}{\pi^D(a_t | s_t)}$
W: Per-trajectory importance weights $W = \prod_{t=0}^{H-1} q_t$

🚀 Solve

def solve(self, gamma, **kwargs):

Estimates the policy value using importance weighting. This is done by setting W and m in the kwargs and calling OnPE.solve.

Args:

gamma (float): Discount factor $\gamma$.
weighted (bool, kwargs): Whether to normalize by the sum of weights np.sum(W) instead of the number of trajectories m.

Returns:

rho_hat (float): Estimated policy value $\hat \rho^\pi$.
info (dict): Reserved for future diagnostics.

🧪 Example

estimator = IS(dataset)
gamma = 0.99

rho_hat, info = estimator.solve(gamma, weighted=True)
print(f"Estimated policy value: {rho_hat}")

This performs standard IS evaluation using per-trajectory weights, optionally normalized by total weight when weighted=True.