Q learning algorithm:convergence - chunhualiao/public-docs GitHub Wiki

Q learning algorithm

Q-Learning Convergence Analysis

1. Mathematical Proof of Convergence

Q-learning is guaranteed to converge to the optimal policy under these conditions:

Required Conditions (All Met in This Implementation)

  1. Finite States and Actions

    • States: 25 (5×5 grid)
    • Actions: 4 (up, down, left, right)
    • ✓ Both are finite
  2. All State-Action Pairs Visited Infinitely Often

    • Epsilon-greedy policy ensures this:
    exploration_rate = max(0.01, exploration_rate - 0.001)
    
    • Never goes below 1% chance of random exploration
    • ✓ Ensures all pairs will be visited
  3. Rewards are Bounded

    • Rewards are either 0 or 1
    • ✓ Clearly bounded
  4. Markov Property

    • Next state depends only on current state and action
    • No hidden dependencies
    • ✓ Grid world is fully observable

2. Practical Convergence Mechanism

Value Propagation Example

Let's trace how values propagate backward from the goal:

  1. Initial State

    Goal (24): Reward = 1
    Adjacent to goal (23, 19): Q-values = 0
    All other states: Q-values = 0
    
  2. After First Goal Discovery

    # At state 23 (next to goal), action Right
    new_q = 0 + 0.1 * (1 + 0.9 * 0 - 0)
    # new_q = 0.1
    
  3. Value Propagation

    # Two steps from goal (state 22)
    # After several visits:
    new_q = 0 + 0.1 * (0 + 0.9 * 0.1 - 0)
    # new_q ≈ 0.009
    

Convergence Pattern

The Q-values stabilize in this pattern:

  1. Goal state actions → highest values
  2. States near goal → high values
  3. Distant states → lower but stable values

Example Q-value progression for state 23, action Right:

0 → 0.1 → 0.19 → 0.271 → 0.344 → 0.409 → ... → 0.9

3. Bellman Equation Satisfaction

The algorithm converges when it satisfies the Bellman equation:

Q(s,a) = R(s,a) + γ * max[Q(s',a')]

For our implementation:

# At convergence (example numbers):
state = 23  # One step from goal
action = 3  # Right
reward = 0  # Step reward
next_state = 24  # Goal
gamma = 0.9

# Bellman equation satisfied:
Q[23,3] = 0 + 0.9 * max(Q[24,:])
0.9 = 0 + 0.9 * 1.0

4. Practical Verification

To verify convergence:

  1. Q-value changes become minimal
  2. Policy becomes stable
  3. Average rewards stabilize

Example convergence check:

old_q = q_table.copy()
# After update
max_change = np.max(np.abs(q_table - old_q))
if max_change < 0.001:
    print("Converged!")