Q learning algorithm:episodes and steps - chunhualiao/public-docs GitHub Wiki

Q learning algorithm>

Let me explain episodes and steps using concrete examples from the GridWorld code:

An episode is one complete attempt by the agent to reach the goal - like one game or one try. It ends when either:

  1. The agent reaches the goal (success!)
  2. The maximum steps (100) are reached (failure)

A step is one single move within an episode - like one turn in a game.

Here's a concrete example of an episode with its steps:

Episode 1:
Step 1: Agent at (0,0) → moves RIGHT → (0,1), reward = 0
Step 2: Agent at (0,1) → moves DOWN → (1,1), reward = 0
Step 3: Agent at (1,1) → moves RIGHT → (1,2), reward = 0
Step 4: Agent at (1,2) → moves DOWN → (2,2), reward = 0
...
Step 8: Agent at (4,4) → GOAL REACHED!, reward = 1
Episode ends after 8 steps (Success)

Episode 2:
Step 1: Agent at (0,0) → moves UP → stays at (0,0), reward = 0
Step 2: Agent at (0,0) → moves LEFT → stays at (0,0), reward = 0
Step 3: Agent at (0,0) → moves DOWN → (1,0), reward = 0
...
Step 100: Maximum steps reached without finding goal
Episode ends after 100 steps (Failure)

In the code, this structure appears here:

for episode in range(episodes):  # Do 1000 episodes
    state = env.reset()  # Start at (0,0)
    total_reward = 0
    
    for step in range(max_steps):  # Maximum 100 steps per episode
        action = agent.choose_action(state)  # Pick a move
        next_state, reward, done = env.step(action)  # Make the move
        
        # Learn from what happened
        agent.learn(state, action, reward, next_state)
        
        if done:  # If we reached the goal
            break  # End this episode early

Think of it like:

  • Episodes = Number of games played
  • Steps = Number of moves in each game
  • The agent gets better by playing many games (episodes)
  • Each game gives multiple learning opportunities (steps)

The max_steps = 100 limit is important because:

  1. It prevents infinite loops if the agent gets stuck
  2. It encourages the agent to find efficient paths
  3. It gives the agent a fresh start if it's wandering aimlessly

Would you like me to:

  1. Show more detailed episode examples?
  2. Explain how learning happens within each step?
  3. Demonstrate how episodes help improve the agent's performance over time?