Value iteration

Makes use of Bellman equation to create a Bellman update step. $$V_{i+1}(s) \leftarrow \max_{a} \sum {P(s'|s,a) [R(s'|s,a)+\gamma V_i(s')]}$$

Note that value iteration finds the $a \in A(s)$ that maximises the current iteration of the state
Information to update the current iteration makes use of experience from the previous iteration

Pseudocode

Set treshold to some delta
Repeat
// This outer repeat loop is a while loop that continues till convergence. Note that this outer loop is the one that is increasing the iteration count. The for loop is to update all V(s) values in the current iteration
    For each s in S
        // this for loop is to update the values in the current iteration. Suppose it is the first iteration, then most of the values might be negative.
        Perform bellman update
Until (V'(s) - V(s)) less than threshold

Convergence

The algorithm is guaranteed to converge due to the property of contraction. A contraction occurs when there is a single fixed point and if the function applied causes the value to move towards the fixed point.

Initialisation

$v_0$ is chosen arbitrarily with terminal state value 0
Array of value state
- 2 arrays, one for old value $v_i(s)$ and another one $v_{i+1}(s)$
- Single array with update in-place

Limitation

When the number of possible actions are huge the computation of $\displaystyle \max_{a\in A}{Q(s,a)}$ can be costly at each stage
- Policy iteration could be more desirable instead

Links:

Value iteration - HiIAmTzeKean/SC3000-Artificial-Intelligence GitHub Wiki

Value iteration

Pseudocode

Convergence

Initialisation

Limitation

⚠️ GitHub.com Fallback ⚠️

Value iteration - HiIAmTzeKean/SC3000-Artificial-Intelligence GitHub Wiki

Value iteration

Pseudocode

Convergence

Initialisation

Limitation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️