Background: Estimators - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

Estimators

When performing on-policy evaluation, we sample a dataset of trajectories $( \tau_i )_{i=1}^m \sim \pi$. Then we consider the on-policy estimator

  • $\hat \rho^{\pi, \gamma}_\text{OnPE}$,

$$ \frac{1}{m} \sum_{i=1}^m R_i, \quad \text{where} \quad R_i \doteq R^\gamma(\tau_i). $$

For classical off-policy evaluation, we sample trajectories $( \tau_i )_{i=1}^m \sim \pi^{\mathcal D}$ and define

  • the simple $\hat \rho^{\pi, \gamma}_\text{SIS}$ and
  • weighted importance sampling estimator $\hat \rho^{\pi, \gamma}_\text{WIS}$, respectively,

$$ \frac{1}{m} \sum_{i=1}^m R_i W_i, \quad \frac{1}{ \sum_{i=1}^m W_i } \sum_{i=1}^m R_i W_i, \quad \text{where} \quad W_i \doteq W_{\pi / \pi^{\mathcal D}}(\tau_i). $$

On the other hand, for distribution correction estimation (DICE), we first approximate the stationary distribution correction $w^\gamma_{\pi / \mathcal D}$ by some $\hat w^\gamma_{\pi / \mathcal D}$ and then consider

  • the simple $\hat \rho^{\pi, \gamma}_\text{s}$ and
  • weighted DICE estimator $\hat \rho^{\pi, \gamma}_\text{w}$, respectively,

$$ \frac{1}{n} \sum_{i=1}^n r_i w_i, \quad \frac{1}{ \sum_{i=1}^n w_i } \sum_{i=1}^n r_i w_i, \quad \text{where} \quad w_i \doteq \hat w^\gamma_{\pi / \mathcal D}(s_i, a_i). $$

Beyond the Law of Large Numbers, the consistency of the weighted estimators requires, respectively,

$$ \mathbb E_{ \tau \sim \pi^{\mathcal D} } [ W_{\pi / \pi^{\mathcal D}}(\tau) ] = 1 \quad \text{and} \quad \mathbb E_{ (s, a) \sim d^{\mathcal D} } [ w^\gamma_{\pi / \mathcal D}(s, a) ] = 1. $$