Background: Assumptions - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

Assumptions

Assumption (MDP ergodicity). A finite MDP is called ergodic for a policy if:

  1. Irreducibility: Starting from any state-action pair, it is possible to reach any other state-action pair within a finite number of steps with non-zero probability.
  2. Aperiodicity: For every state-action pair the greatest common divisor of all return times to this state-action pair is one.
  3. Positive Recurrence: The expected return time to any state-action pair is finite.

Assumption (behavior policy coverage). The evaluation policy $\pi$ is absolutely continuous with respect to the behavior policy $\pi^{\mathcal D}$, i.e.,

$$ \pi(a \mid s) > 0 \implies \pi^{\mathcal D}(a \mid s) > 0 \quad \text{for all} \quad s \in S ~ \text{and} ~ a \in A. $$

Assumption (dataset coverage). The stationary distribution $d^{\pi, \gamma}$ of the evaluation policy $\pi$ is absolutely continuous with respect to the dataset distribution $d^{\mathcal D}$, i.e.,

$$ d^{\pi, \gamma}(s, a) > 0 \implies d^{\mathcal D}(s, a) > 0 \quad \text{for all} \quad (s, a) \in S \times A. $$