Background: Bellman Equations - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

Bellman Equations

Finding $Q^{\pi, \gamma}$, $d^{\pi, \gamma}$, or $w^\gamma_{\pi / \mathcal D}$ by directly using their definitions can be cumbersome. Instead, one uses the Bellman equations in the Lemma and Corollary below. To this end, we define the expected Bellman operator and its adjoint as

$$ \mathcal P^\pi Q(s, a) \doteq \int_{S \times A} Q(s^\prime, a^\prime) T^\pi(s^\prime, a^\prime \mid s, a) ~ \mathrm d s^\prime ~ \mathrm d a^\prime = \mathbb E_{ ( s^\prime, a^\prime ) \sim T^\pi(s, a) } [ Q(s^\prime, a^\prime) ], $$

$$ \mathcal P^\pi_T d(s^\prime, a^\prime) \doteq \int_{S \times A} d(s, a) T^\pi(s^\prime, a^\prime \mid s, a) ~ \mathrm d s ~ \mathrm d a, $$

where $Q, d: S \times A \to \mathbb R$ and $(s, a), (s^\prime, a^\prime) \in S \times A$.

Lemma (Bellman equations). For $0 < \gamma \leq 1$, the pair $\rho^{\pi, \gamma}$ and $Q^{\pi, \gamma}$ is a solution to the forward Bellman equations

$$ Q = \mathcal B^{\pi, \gamma, \rho} Q \doteq r - \rho 1_{\gamma = 1} + \gamma \mathcal P^\pi Q, \quad \text{where} \quad \rho \in \mathbb R \quad \text{and} \quad Q: S \times A \to \mathbb R. $$

For $0 < \gamma \leq 1$, $d^{\pi, \gamma}$ is a solution to the backward Bellman equations

$$ d = \mathcal B^{\pi, \gamma}_T d \doteq (1 - \gamma) d^\pi_0 + \gamma \mathcal P^\pi_T d, \quad \text{where} \quad d: S \times A \to \mathbb R. $$

For $0 < \gamma < 1$, the solutions $\rho^{\pi, \gamma}$, $Q^{\pi, \gamma}$, and $d^{\pi, \gamma}$ are unique. In case $| S \times A | < \infty$ and Assumption (MDP ergodicity) holds, the uniqueness of $d^{\pi, \gamma}$ is guaranteed by adding the normalization and non-negativity constraints

$$ \int_{S \times A} d(s, a) ~ \mathrm d s ~ \mathrm d a = 1 \quad \text{and} \quad d \geq 0. $$

Corollary. Let $D^{\mathcal D}$ be the operator that multiplies a function by $d^{\mathcal D}$. Then $w^\gamma_{\pi / \mathcal D}$ is a solution to

$$ D^{\mathcal D} w = \mathcal B^{\pi, \gamma}_T D^{\mathcal D} w = (1 - \gamma) d^\pi_0 + \gamma \mathcal P^\pi_T D^{\mathcal D} w, \quad \text{where} \quad w: S \times A \to \mathbb R. $$

For $0 < \gamma < 1$, the solution $w^\gamma_{\pi / \mathcal D}$ is unique. In case $| S \times A | < \infty$ and Assumption (MDP ergodicity) holds, the uniqueness of $w^1_{\pi / \mathcal D}$ on $\text{supp}(d^{\mathcal D})$ is guaranteed by adding the normalization and non-negativity constraints

$$ \mathbb E_{ (s, a) \sim d^{\mathcal D} } [ w(s, a) ] = 1 \quad \text{and} \quad w \geq 0. $$