Background: Assumptions - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki
Assumptions
Assumption (MDP ergodicity). A finite MDP is called ergodic for a policy if:
- Irreducibility: Starting from any state-action pair, it is possible to reach any other state-action pair within a finite number of steps with non-zero probability.
- Aperiodicity: For every state-action pair the greatest common divisor of all return times to this state-action pair is one.
- Positive Recurrence: The expected return time to any state-action pair is finite.
Assumption (behavior policy coverage). The evaluation policy $\pi$ is absolutely continuous with respect to the behavior policy $\pi^{\mathcal D}$, i.e.,
$$ \pi(a \mid s) > 0 \implies \pi^{\mathcal D}(a \mid s) > 0 \quad \text{for all} \quad s \in S ~ \text{and} ~ a \in A. $$
Assumption (dataset coverage). The stationary distribution $d^{\pi, \gamma}$ of the evaluation policy $\pi$ is absolutely continuous with respect to the dataset distribution $d^{\mathcal D}$, i.e.,
$$ d^{\pi, \gamma}(s, a) > 0 \implies d^{\mathcal D}(s, a) > 0 \quad \text{for all} \quad (s, a) \in S \times A. $$