rl_gridworld.py:understanding - chunhualiao/public-docs GitHub Wiki
GridWorld Q-Learning Analysis
1. Environment Structure
The code implements a 5x5 grid world where:
- States are numbered 0-24 (5×5 = 25 states total)
- State calculation:
state = row * size + column
- Example states:
- Top-left (0,0) = 0 * 5 + 0 = 0
- Top-right (0,4) = 0 * 5 + 4 = 4
- Bottom-left (4,0) = 4 * 5 + 0 = 20
- Goal (bottom-right) (4,4) = 4 * 5 + 4 = 24
2. Q-Learning Process
Initial Setup
# Q-table starts as a 25×4 matrix of zeros
# 25 states × 4 actions (up, down, left, right)
q_table = np.zeros((25, 4))
# Parameters
learning_rate (α) = 0.1
discount_factor (γ) = 0.9
initial_exploration_rate (ε) = 1.0
Example Learning Update
Let's walk through one learning step:
- Current state: (1,1) = state 6
- Action taken: Right (action 3)
- Next state: (1,2) = state 7
- Reward received: 0 (not at goal)
Q-value update calculation:
# Old Q-value
current_q = q_table[6, 3] # Let's say it's 0.5
# Best future value from next state
next_max_q = max(q_table[7, :]) # Let's say it's 1.0
# Q-learning formula
new_q = current_q + learning_rate * (reward + discount_factor * next_max_q - current_q)
new_q = 0.5 + 0.1 * (0 + 0.9 * 1.0 - 0.5)
new_q = 0.5 + 0.1 * (0 + 0.9 - 0.5)
new_q = 0.5 + 0.1 * 0.4
new_q = 0.54
Action Selection Example
With exploration_rate = 0.3:
- Generate random number: rand() = 0.4
- Since 0.4 > 0.3, exploit (choose best action)
- For state 6, look at q_table[6]: [0.2, 0.54, 0.1, 0.3]
- Choose action 1 (down) as it has highest Q-value
Exploration Rate Decay
After each episode:
new_rate = max(0.01, current_rate - 0.001)
# Example: 0.5 → 0.499 → 0.498 → ...
3. Complete Episode Example
Starting position: (0,0) = state 0
-
Step 1:
- State: 0
- Action: Down (1)
- New state: 5
- Reward: 0
- Q-update: q_table[0,1] updated
-
Step 2:
- State: 5
- Action: Right (3)
- New state: 6
- Reward: 0
- Q-update: q_table[5,3] updated
-
... continues until reaching goal (state 24) or max steps
The Q-table gradually fills with values representing expected future rewards for each state-action pair. Higher values indicate better actions for reaching the goal.