API 1.1.2. IS - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

IS (Importance Sampling Estimator)

The IS class extends the OnPE (On-Policy Evaluation) estimator by incorporating importance sampling. It performs off-policy evaluation by reweighting returns observed under a behavior policy using importance ratios. This includes both per-step and per-trajectory importance weights derived from the target and behavior policy probabilities, allowing for the estimation of expected returns under the target policy.

๐Ÿงฎ Mathematical Background

Recall the mathematical background for On-Policy Evaluation and that the policy ratio product of a trajectory $\tau$ is defined as

$$ W_{\pi / \pi^D}(\tau) \doteq \prod_{t=0}^{H-1} \frac{ \pi(a_t \mid s_t) }{ \pi^D(a_t \mid s_t) }. $$

For classical off-policy evaluation, we sample trajectories $( \tau_i )_{i=1}^m \sim \pi^D$ and define the simple and weighted importance sampling estimator, respectively,

$$ \hat \rho^\pi_S \doteq \frac{1}{m} \sum_{i=1}^m R_i W_i, \quad \hat \rho^\pi_W \doteq \frac{1}{M} \sum_{i=1}^m R_i W_i, \quad M \doteq \sum_{i=1}^m W_i, \quad W_i \doteq W_{\pi / \pi^{\mathcal D}}(\tau_i). $$

๐Ÿ—๏ธ Constructor

def __init__(self, dataset):

Args:

  • dataset (pd.DataFrame): Dataset with columns:
    • id (int): Episode identifier.
    • t (int): Time step index $t$.
    • act (int): Action $a$.
    • rew (float): Reward $R(s, a, s')$.
    • probs_evaluation or probs (NDArray[float]): Action probabilities under the target policy at the current state $\pi(\cdot \mid s)$.
    • probs_behavior (NDArray[float]): Action probabilities under the behavior policy at the current state $\pi^D(\cdot \mid s)$.

๐Ÿ“ฆ Properties

@property
def dataset(self):

Returns the current dataset.

@OnPE.dataset.setter
def dataset(self, dataset):

Additionally to the setter from OnPE, it preprocesses the dataset by computing:

  • q: Per-step importance weights $q_t = \frac{\pi(a_t | s_t)}{\pi^D(a_t | s_t)}$
  • W: Per-trajectory importance weights $W = \prod_{t=0}^{H-1} q_t$

๐Ÿš€ Solve

def solve(self, gamma, **kwargs):

Estimates the policy value using importance weighting. This is done by setting W and m in the kwargs and calling OnPE.solve.

Args:

  • gamma (float): Discount factor $\gamma$.
  • weighted (bool, kwargs): Whether to normalize by the sum of weights np.sum(W) instead of the number of trajectories m.

Returns:

  • rho_hat (float): Estimated policy value $\hat \rho^\pi$.
  • info (dict): Reserved for future diagnostics.

๐Ÿงช Example

estimator = IS(dataset)
gamma = 0.99

rho_hat, info = estimator.solve(gamma, weighted=True)
print(f"Estimated policy value: {rho_hat}")

This performs standard IS evaluation using per-trajectory weights, optionally normalized by total weight when weighted=True.