Background - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

Background

For a set $X$, let $\Delta(X)$ be the set of probability distributions on $X$. We consider a Markov Decision Process (MDP) $(S, A, R, T, d_0, \gamma)$.

$S$ and $A$ are the set of states and actions, respectively.
$R: S \times A \times S \to \Delta(\mathbb R)$ is the reward function that assigns a state-action pair a reward distribution, but we are often going to treat $R(s, a, s^\prime)$ as a random variable and write $r(s, a) \doteq \mathbb E_{ s^\prime \sim T(s, a) } [ R(s, a, s^\prime) ]$ for the immediate reward function. We assume that the reward function is bounded almost surely.
The transition probability distributions are given via $T: S \times A \to \Delta(S)$ and we write $T(s^\prime \mid s, a)$ for the probability of transitioning into state $s^\prime \in S$ when choosing action $a \in A$ in state $s \in S$.
The initial state distribution, i.e., the distribution which provides the state where the process initiates, is $d_0 \in \Delta(S)$.
The discount factor is $\gamma \in (0, 1]$. We allow our MDP to be discounted or even undiscounted in certain cases.

Let $\pi: S \to \Delta(A)$ be an evaluation policy, where $\pi(a \mid s)$ is the probability of choosing action $a \in A$ in state $s \in S$. Let $\pi^{\mathcal D}: S \to \Delta(A)$ be a behavior policy. Consider the dataset distribution $d^{\mathcal D} \in \Delta(S \times A)$ and the dataset, consistsing of samples of experience,

$$ \mathcal D = \Big ( \left ( s_{0, i}, s_i, a_i, r_i, s_i^\prime \right ) \Big )_{i=1}^n. $$

These samples are not necessarily derived from generating trajectories; they are instead drawn independently according to

$$ s_0 \sim d_0, \ (s, a) \sim d^{\mathcal D}, \ r \sim R(s, a, s^\prime), \ s^\prime \sim T(s, a). $$

For convenience, we define, $(s_0, a_0), (s, a), (s^\prime, a^\prime) \in S \times A$,

$$ d^\pi_0(s_0, a_0) \doteq \pi(a_0 \mid s_0) d_0(s_0) \quad \text{and} \quad T^\pi(s^\prime, a^\prime \mid s, a) \doteq \pi(a^\prime \mid s^\prime) T(s^\prime \mid s, a). $$

We write $\tau \sim \pi$ for sampling a trajectory $\tau = (s_0, a_0, r_0, \dots, s_H)$ as, $t = 0, \dots, H-1$,

$$ s_0 \sim d_0, ~ a_t \sim \pi(s_t), ~ s_{t+1} \sim T(s_t, a_t), ~ r_t \sim R(s_t, a_t, s_{t+1}). $$

We define

the policy value (PV) $\rho^{\pi, \gamma}$,
the state-action value function (VF) $Q^{\pi, \gamma}$,
the stationary distribution (SD) $d^{\pi, \gamma}$,

$$ \rho^{\pi, \gamma} \doteq \lim_{H \to \infty} \frac{1}{ \sum_{t=0}^H \gamma^t } \mathbb E_{\tau \sim \pi} \left [ \sum_{t=0}^H \gamma^t r_t \right ], $$

$$ Q^{\pi, \gamma}(s, a) \doteq \mathbb E_{\tau \sim \pi} \left [ \sum_{t=0}^\infty \gamma^t ( r_t - \rho^{\pi, \gamma} 1_{\gamma = 1} ) ~ : ~ s_0 = s, ~ a_0 = a \right ], $$

$$ d^{\pi, \gamma}(s, a) \doteq \lim_{H \to \infty} \frac{1}{ \sum_{t=0}^H \gamma^t } \mathbb E_{\tau \sim \pi} \left [ \sum_{t=0}^H \gamma^t 1_{ s_t = s, ~ a_t = a } \right ]. $$

The policy value can be rewritten as the expected episode return,

$$ \rho^{\pi, \gamma} = \lim_{H \to \infty} \mathbb E_{\tau \sim \pi}[ R^\gamma(\tau) ], \quad \text{where} \quad R^\gamma(\tau) \doteq (1 - \gamma) \sum_{i=0}^{H-1} \gamma^t r_i, \quad \text{for} ~ 0 < \gamma < 1, \quad \text{and} \quad R^\gamma(\tau) \doteq \frac{1}{H} \sum_{i=0}^{H-1} r_i \quad \text{for} ~ \gamma = 1. $$

Under Assumption (behavior policy coverage), we can define the policy ratio product $W_{\pi / \pi^{\mathcal D}}(\tau)$ as the quotient of the probabilities of the trajectory $\tau$ occurring under the policies $\pi$ and $\pi^{\mathcal D}$, respectively. Applying importance sampling to $\pi$, we get

$$ \rho^{\pi, \gamma} = \lim_{H \to \infty} \mathbb E_{\tau \sim \pi^{\mathcal D}}[ R^\gamma(\tau) W_{\pi / \pi^{\mathcal D}}(\tau) ], \quad \text{where} \quad W_{\pi / \pi^{\mathcal D}}(\tau) \doteq \prod_{i=0}^{H-1} \frac{ \pi(a_i \mid s_i) }{ \pi^{\mathcal D}(a_i \mid s_i) }. $$

It is commonly known that we can obtain the policy value $\rho^{\pi, \gamma}$ through $Q^{\pi, \gamma}$ or $d^{\pi, \gamma}$.

Lemma. For $0 < \gamma \leq 1$, it holds that

$$ \rho^{\pi, \gamma} = (1 - \gamma) \mathbb E_{ (s_0, a_0) \sim d^\pi_0 } [ Q^{\pi, \gamma}(s_0, a_0) ] + \rho^{\pi, \gamma} 1_{\gamma = 1} = \mathbb E_{ (s, a) \sim d^{\pi, \gamma} } [ r(s, a) ]. $$

Under Assumption (dataset coverage), we can define the stationary distribution correction $w^\gamma_{\pi / \mathcal D}$ as the quotient of $d^{\pi, \gamma}$ and $d^{\mathcal D}$. Applying importance sampling to $d^{\pi, \gamma}$, we get

$$ \rho^{\pi, \gamma} = \mathbb E_{ (s, a) \sim d^{\mathcal D} } [ r(s, a) w^\gamma_{\pi / \mathcal D}(s, a) ], \quad \text{where} \quad w^\gamma_{\pi / \mathcal D}(s, a) \doteq \frac{ d^{\pi, \gamma}(s, a) }{ d^{\mathcal D}(s, a) }. $$

We can define the simple or non-weighted Law of Large Numbers estimators

$\hat \rho^{\pi, \gamma}_\text{OnPE}$ on-policy,
$\hat \rho^{\pi, \gamma}_\text{SIS}$ simple importance-sampling,
$\hat \rho^{\pi, \gamma}_\text{s}$ simple DICE.

When dividing by the sum of the weights assigned to each sample, as opposed to the number of samples, we obtain the weighted estimators

$\hat \rho^{\pi, \gamma}_\text{WIS}$ weighted importance-sampling,
$\hat \rho^{\pi, \gamma}_\text{w}$ weighted DICE.

Get detailed information on: