Q learning algorithm:episodes and steps - chunhualiao/public-docs GitHub Wiki
Let me explain episodes and steps using concrete examples from the GridWorld code:
An episode is one complete attempt by the agent to reach the goal - like one game or one try. It ends when either:
- The agent reaches the goal (success!)
- The maximum steps (100) are reached (failure)
A step is one single move within an episode - like one turn in a game.
Here's a concrete example of an episode with its steps:
Episode 1:
Step 1: Agent at (0,0) → moves RIGHT → (0,1), reward = 0
Step 2: Agent at (0,1) → moves DOWN → (1,1), reward = 0
Step 3: Agent at (1,1) → moves RIGHT → (1,2), reward = 0
Step 4: Agent at (1,2) → moves DOWN → (2,2), reward = 0
...
Step 8: Agent at (4,4) → GOAL REACHED!, reward = 1
Episode ends after 8 steps (Success)
Episode 2:
Step 1: Agent at (0,0) → moves UP → stays at (0,0), reward = 0
Step 2: Agent at (0,0) → moves LEFT → stays at (0,0), reward = 0
Step 3: Agent at (0,0) → moves DOWN → (1,0), reward = 0
...
Step 100: Maximum steps reached without finding goal
Episode ends after 100 steps (Failure)
In the code, this structure appears here:
for episode in range(episodes): # Do 1000 episodes
state = env.reset() # Start at (0,0)
total_reward = 0
for step in range(max_steps): # Maximum 100 steps per episode
action = agent.choose_action(state) # Pick a move
next_state, reward, done = env.step(action) # Make the move
# Learn from what happened
agent.learn(state, action, reward, next_state)
if done: # If we reached the goal
break # End this episode early
Think of it like:
- Episodes = Number of games played
- Steps = Number of moves in each game
- The agent gets better by playing many games (episodes)
- Each game gives multiple learning opportunities (steps)
The max_steps = 100 limit is important because:
- It prevents infinite loops if the agent gets stuck
- It encourages the agent to find efficient paths
- It gives the agent a fresh start if it's wandering aimlessly
Would you like me to:
- Show more detailed episode examples?
- Explain how learning happens within each step?
- Demonstrate how episodes help improve the agent's performance over time?