Q learning algorithm:convergence - chunhualiao/public-docs GitHub Wiki

Q learning algorithm

bellman equation

Q-Learning Convergence Analysis

1. Mathematical Proof of Convergence

Q-learning is guaranteed to converge to the optimal policy under these conditions:

Required Conditions (All Met in This Implementation)

Finite States and Actions
- States: 25 (5×5 grid)
- Actions: 4 (up, down, left, right)
- ✓ Both are finite
All State-Action Pairs Visited Infinitely Often
- Epsilon-greedy policy ensures this:
```
exploration_rate = max(0.01, exploration_rate - 0.001)
```
- Never goes below 1% chance of random exploration
- ✓ Ensures all pairs will be visited
Rewards are Bounded
- Rewards are either 0 or 1
- ✓ Clearly bounded
Markov Property
- Next state depends only on current state and action
- No hidden dependencies
- ✓ Grid world is fully observable

2. Practical Convergence Mechanism

Value Propagation Example

Let's trace how values propagate backward from the goal:

Initial State

Goal (24): Reward = 1
Adjacent to goal (23, 19): Q-values = 0
All other states: Q-values = 0

After First Goal Discovery

# At state 23 (next to goal), action Right
new_q = 0 + 0.1 * (1 + 0.9 * 0 - 0)
# new_q = 0.1

Value Propagation

# Two steps from goal (state 22)
# After several visits:
new_q = 0 + 0.1 * (0 + 0.9 * 0.1 - 0)
# new_q ≈ 0.009

Convergence Pattern

The Q-values stabilize in this pattern:

Goal state actions → highest values
States near goal → high values
Distant states → lower but stable values

Example Q-value progression for state 23, action Right:

0 → 0.1 → 0.19 → 0.271 → 0.344 → 0.409 → ... → 0.9

3. Bellman Equation Satisfaction

The algorithm converges when it satisfies the Bellman equation:

Q(s,a) = R(s,a) + γ * max[Q(s',a')]

For our implementation:

# At convergence (example numbers):
state = 23  # One step from goal
action = 3  # Right
reward = 0  # Step reward
next_state = 24  # Goal
gamma = 0.9

# Bellman equation satisfied:
Q[23,3] = 0 + 0.9 * max(Q[24,:])
0.9 = 0 + 0.9 * 1.0

4. Practical Verification

To verify convergence:

Q-value changes become minimal
Policy becomes stable
Average rewards stabilize

Example convergence check:

old_q = q_table.copy()
# After update
max_change = np.max(np.abs(q_table - old_q))
if max_change < 0.001:
    print("Converged!")