bellman equation - chunhualiao/public-docs GitHub Wiki

Q learning algorithm>

mathematical proof of convergence

I'll explain the Bellman equation in an intuitive way for beginners, using simple examples.

The Bellman equation is like a recipe for making good decisions by thinking about both immediate and future rewards. Let's break it down:

V(s) = R(s) + γ * max[V(s')]

Think of it like planning a road trip:

  1. Current Value V(s)

    • This is like asking: "How good is my current position?"
    • Example: You're in Las Vegas
  2. Immediate Reward R(s)

    • This is what you get right now
    • Example: Las Vegas has casinos and shows, so R(Vegas) = 10 fun points
  3. Future Value max[V(s')]

    • This is the best possible outcome from your next stop
    • Example: From Vegas, you could go to:
      • Grand Canyon (value = 15)
      • Los Angeles (value = 12)
      • Salt Lake City (value = 8)
    • max[V(s')] would be 15 (Grand Canyon)
  4. Discount Factor γ (gamma)

    • This is how much you care about future rewards
    • Usually between 0 and 1 (like 0.9)
    • Example:
      • γ = 0.9 means future rewards are 90% as important as immediate ones
      • γ = 0.1 means you mostly care about immediate rewards

Let's solve this example:

V(Vegas) = R(Vegas) + γ * max[V(next_cities)]
V(Vegas) = 10 + 0.9 * 15
V(Vegas) = 10 + 13.5
V(Vegas) = 23.5

For Q-learning specifically, we use a variation:

Q(s,a) = R(s,a) + γ * max[Q(s',a')]

Here's the difference:

  • Instead of just states (V), we look at state-action pairs (Q)
  • Example:
    • Q(Vegas, "go to Grand Canyon")
    • Q(Vegas, "go to Los Angeles")
    • Q(Vegas, "go to Salt Lake City")

Real example from our grid world:

# Let's say you're one step away from the goal
current_state = (4,3)  # Next to goal
action = "move right"  # Moves to goal
reward = 0  # Step reward
next_state = (4,4)  # Goal state

Q(current_state, action) = 0 + 0.9 * 1.0
                        = 0.9

Key insights:

  1. The equation balances immediate vs future rewards
  2. It helps calculate the true value of each decision
  3. By repeatedly applying this equation, the agent learns optimal behavior

Think of it like this:

  • If you're planning a vacation, you consider:
    • How fun is this place? (immediate reward)
    • Where can I go from here? (future value)
    • How much do I care about the next destination? (discount factor)

Would you like me to:

  1. Show more practical examples?
  2. Explain how different discount factors affect behavior?
  3. Demonstrate how this works in specific scenarios?

in practice: add a learning rate: alpha

R + γ max(Q(s', a')) - Q(s, a)

  • the difference between new score and old score
  • times a learning rate: alpha

Let's break down the Q-learning formula in simple terms:

Formula:
Q(s, a) = Q(s, a) + α [R + γ max(Q(s', a')) - Q(s, a)]

Imagine you're learning to navigate a maze:

  1. Q(s, a) = Your current "score" for taking action a in state s
    (Like remembering "If I turn left at this corridor, how good is that?")

  2. α (Learning Rate) = How quickly you update your knowledge
    (Small α = slow learner, big α = quick to change opinions)

  3. R (Reward) = Immediate candy you get for your action
    (Found a cookie? +1. Hit a wall? -1)

  4. γ (Discount Factor) = How much you care about future rewards
    (γ = 0.9 means "Future candy is 90% as valuable as candy now")

  5. max(Q(s', a')) = Best possible score in the new state s'
    (Looking ahead: "From my new position, what's my best move?")

How it works:

  1. Take action a in state s ➔ get reward R and move to s'
  2. Calculate error: (What you got + Future potential) - What you expected
  3. Update your original guess slightly using this error

Real-life analogy:
Imagine guessing the price of a used bike:

  • Original guess: $200 (your Q-value)
  • Someone tells you: "It actually sold for $250, and similar bikes sell for $300"
  • You adjust: New price = 200 + 0.1*(250 + 0.9300 - 200)
    = 200 + 0.1
    (250+270-200)
    = 200 + 32 = $232

Key takeaways:

  • The formula helps learn from both immediate results (R) and future possibilities (max(Q(s', a')))
  • Updates happen gradually (controlled by α)
  • Future rewards matter less than immediate ones (controlled by γ)
  • The agent constantly improves its "cheat sheet" (Q-table) of what actions work best

This is how AI learns to make good decisions through trial and error!