Q learning algorithm:convergence - chunhualiao/public-docs GitHub Wiki
Q-Learning Convergence Analysis
1. Mathematical Proof of Convergence
Q-learning is guaranteed to converge to the optimal policy under these conditions:
Required Conditions (All Met in This Implementation)
-
Finite States and Actions
- States: 25 (5×5 grid)
- Actions: 4 (up, down, left, right)
- ✓ Both are finite
-
All State-Action Pairs Visited Infinitely Often
- Epsilon-greedy policy ensures this:
exploration_rate = max(0.01, exploration_rate - 0.001)
- Never goes below 1% chance of random exploration
- ✓ Ensures all pairs will be visited
-
Rewards are Bounded
- Rewards are either 0 or 1
- ✓ Clearly bounded
-
Markov Property
- Next state depends only on current state and action
- No hidden dependencies
- ✓ Grid world is fully observable
2. Practical Convergence Mechanism
Value Propagation Example
Let's trace how values propagate backward from the goal:
-
Initial State
Goal (24): Reward = 1 Adjacent to goal (23, 19): Q-values = 0 All other states: Q-values = 0
-
After First Goal Discovery
# At state 23 (next to goal), action Right new_q = 0 + 0.1 * (1 + 0.9 * 0 - 0) # new_q = 0.1
-
Value Propagation
# Two steps from goal (state 22) # After several visits: new_q = 0 + 0.1 * (0 + 0.9 * 0.1 - 0) # new_q ≈ 0.009
Convergence Pattern
The Q-values stabilize in this pattern:
- Goal state actions → highest values
- States near goal → high values
- Distant states → lower but stable values
Example Q-value progression for state 23, action Right:
0 → 0.1 → 0.19 → 0.271 → 0.344 → 0.409 → ... → 0.9
3. Bellman Equation Satisfaction
The algorithm converges when it satisfies the Bellman equation:
Q(s,a) = R(s,a) + γ * max[Q(s',a')]
For our implementation:
# At convergence (example numbers):
state = 23 # One step from goal
action = 3 # Right
reward = 0 # Step reward
next_state = 24 # Goal
gamma = 0.9
# Bellman equation satisfied:
Q[23,3] = 0 + 0.9 * max(Q[24,:])
0.9 = 0 + 0.9 * 1.0
4. Practical Verification
To verify convergence:
- Q-value changes become minimal
- Policy becomes stable
- Average rewards stabilize
Example convergence check:
old_q = q_table.copy()
# After update
max_change = np.max(np.abs(q_table - old_q))
if max_change < 0.001:
print("Converged!")