rl_gridworld.py:understanding - chunhualiao/public-docs GitHub Wiki

rl_gridworld.py

GridWorld Q-Learning Analysis

1. Environment Structure

The code implements a 5x5 grid world where:

  • States are numbered 0-24 (5×5 = 25 states total)
  • State calculation: state = row * size + column
  • Example states:
    • Top-left (0,0) = 0 * 5 + 0 = 0
    • Top-right (0,4) = 0 * 5 + 4 = 4
    • Bottom-left (4,0) = 4 * 5 + 0 = 20
    • Goal (bottom-right) (4,4) = 4 * 5 + 4 = 24

2. Q-Learning Process

Initial Setup

# Q-table starts as a 25×4 matrix of zeros
# 25 states × 4 actions (up, down, left, right)
q_table = np.zeros((25, 4))

# Parameters
learning_rate (α) = 0.1
discount_factor (γ) = 0.9
initial_exploration_rate (ε) = 1.0

Example Learning Update

Let's walk through one learning step:

  1. Current state: (1,1) = state 6
  2. Action taken: Right (action 3)
  3. Next state: (1,2) = state 7
  4. Reward received: 0 (not at goal)

Q-value update calculation:

# Old Q-value
current_q = q_table[6, 3]  # Let's say it's 0.5

# Best future value from next state
next_max_q = max(q_table[7, :])  # Let's say it's 1.0

# Q-learning formula
new_q = current_q + learning_rate * (reward + discount_factor * next_max_q - current_q)
new_q = 0.5 + 0.1 * (0 + 0.9 * 1.0 - 0.5)
new_q = 0.5 + 0.1 * (0 + 0.9 - 0.5)
new_q = 0.5 + 0.1 * 0.4
new_q = 0.54

Action Selection Example

With exploration_rate = 0.3:

  1. Generate random number: rand() = 0.4
  2. Since 0.4 > 0.3, exploit (choose best action)
  3. For state 6, look at q_table[6]: [0.2, 0.54, 0.1, 0.3]
  4. Choose action 1 (down) as it has highest Q-value

Exploration Rate Decay

After each episode:

new_rate = max(0.01, current_rate - 0.001)
# Example: 0.5 → 0.499 → 0.498 → ...

3. Complete Episode Example

Starting position: (0,0) = state 0

  1. Step 1:

    • State: 0
    • Action: Down (1)
    • New state: 5
    • Reward: 0
    • Q-update: q_table[0,1] updated
  2. Step 2:

    • State: 5
    • Action: Right (3)
    • New state: 6
    • Reward: 0
    • Q-update: q_table[5,3] updated
  3. ... continues until reaching goal (state 24) or max steps

The Q-table gradually fills with values representing expected future rewards for each state-action pair. Higher values indicate better actions for reaching the goal.