API 1.1.2. IS - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
IS
(Importance Sampling Estimator)
The IS
class extends the OnPE
(On-Policy Evaluation) estimator by incorporating importance sampling. It performs off-policy evaluation by reweighting returns observed under a behavior policy using importance ratios. This includes both per-step and per-trajectory importance weights derived from the target and behavior policy probabilities, allowing for the estimation of expected returns under the target policy.
๐งฎ Mathematical Background
Recall the mathematical background for On-Policy Evaluation and that the policy ratio product of a trajectory $\tau$ is defined as
$$ W_{\pi / \pi^D}(\tau) \doteq \prod_{t=0}^{H-1} \frac{ \pi(a_t \mid s_t) }{ \pi^D(a_t \mid s_t) }. $$
For classical off-policy evaluation, we sample trajectories $( \tau_i )_{i=1}^m \sim \pi^D$ and define the simple and weighted importance sampling estimator, respectively,
$$ \hat \rho^\pi_S \doteq \frac{1}{m} \sum_{i=1}^m R_i W_i, \quad \hat \rho^\pi_W \doteq \frac{1}{M} \sum_{i=1}^m R_i W_i, \quad M \doteq \sum_{i=1}^m W_i, \quad W_i \doteq W_{\pi / \pi^{\mathcal D}}(\tau_i). $$
๐๏ธ Constructor
def __init__(self, dataset):
Args:
dataset
(pd.DataFrame): Dataset with columns:id
(int): Episode identifier.t
(int): Time step index $t$.act
(int): Action $a$.rew
(float): Reward $R(s, a, s')$.probs_evaluation
orprobs
(NDArray[float]): Action probabilities under the target policy at the current state $\pi(\cdot \mid s)$.probs_behavior
(NDArray[float]): Action probabilities under the behavior policy at the current state $\pi^D(\cdot \mid s)$.
๐ฆ Properties
@property
def dataset(self):
Returns the current dataset.
@OnPE.dataset.setter
def dataset(self, dataset):
Additionally to the setter from OnPE
, it preprocesses the dataset by computing:
q
: Per-step importance weights $q_t = \frac{\pi(a_t | s_t)}{\pi^D(a_t | s_t)}$W
: Per-trajectory importance weights $W = \prod_{t=0}^{H-1} q_t$
๐ Solve
def solve(self, gamma, **kwargs):
Estimates the policy value using importance weighting. This is done by setting W
and m
in the kwargs
and calling OnPE.solve
.
Args:
gamma
(float): Discount factor $\gamma$.weighted
(bool, kwargs): Whether to normalize by the sum of weightsnp.sum(W)
instead of the number of trajectoriesm
.
Returns:
rho_hat
(float): Estimated policy value $\hat \rho^\pi$.info
(dict): Reserved for future diagnostics.
๐งช Example
estimator = IS(dataset)
gamma = 0.99
rho_hat, info = estimator.solve(gamma, weighted=True)
print(f"Estimated policy value: {rho_hat}")
This performs standard IS evaluation using per-trajectory weights, optionally normalized by total weight when weighted=True
.