rl_gridworld.py:understanding - chunhualiao/public-docs GitHub Wiki

GridWorld Q-Learning Analysis

1. Environment Structure

The code implements a 5x5 grid world where:

States are numbered 0-24 (5×5 = 25 states total)
State calculation: state = row * size + column
Example states:
- Top-left (0,0) = 0 * 5 + 0 = 0
- Top-right (0,4) = 0 * 5 + 4 = 4
- Bottom-left (4,0) = 4 * 5 + 0 = 20
- Goal (bottom-right) (4,4) = 4 * 5 + 4 = 24

2. Q-Learning Process

Initial Setup

# Q-table starts as a 25×4 matrix of zeros
# 25 states × 4 actions (up, down, left, right)
q_table = np.zeros((25, 4))

# Parameters
learning_rate (α) = 0.1
discount_factor (γ) = 0.9
initial_exploration_rate (ε) = 1.0

Example Learning Update

Let's walk through one learning step:

Current state: (1,1) = state 6
Action taken: Right (action 3)
Next state: (1,2) = state 7
Reward received: 0 (not at goal)

Q-value update calculation:

# Old Q-value
current_q = q_table[6, 3]  # Let's say it's 0.5

# Best future value from next state
next_max_q = max(q_table[7, :])  # Let's say it's 1.0

# Q-learning formula
new_q = current_q + learning_rate * (reward + discount_factor * next_max_q - current_q)
new_q = 0.5 + 0.1 * (0 + 0.9 * 1.0 - 0.5)
new_q = 0.5 + 0.1 * (0 + 0.9 - 0.5)
new_q = 0.5 + 0.1 * 0.4
new_q = 0.54

Action Selection Example

With exploration_rate = 0.3:

Generate random number: rand() = 0.4
Since 0.4 > 0.3, exploit (choose best action)
For state 6, look at q_table[6]: [0.2, 0.54, 0.1, 0.3]
Choose action 1 (down) as it has highest Q-value

Exploration Rate Decay

After each episode:

new_rate = max(0.01, current_rate - 0.001)
# Example: 0.5 → 0.499 → 0.498 → ...

3. Complete Episode Example

Starting position: (0,0) = state 0

Step 1:
- State: 0
- Action: Down (1)
- New state: 5
- Reward: 0
- Q-update: q_table[0,1] updated
Step 2:
- State: 5
- Action: Right (3)
- New state: 6
- Reward: 0
- Q-update: q_table[5,3] updated
... continues until reaching goal (state 24) or max steps

The Q-table gradually fills with values representing expected future rewards for each state-action pair. Higher values indicate better actions for reaching the goal.