Q learning - HiIAmTzeKean/SC3000-Artificial-Intelligence GitHub Wiki


tags:

  • 🌱
  • AI
  • ComputerScience date: 20--Feb--2023

Q learning

  • Learn the optimal action for each state
    • State transition probability not known
    • Use experience to learn the best action
  • Utilise temporal difference to make updates
    • At state s, take action a
    • Observe the next time step s' and reward r
    • A single look ahead trajectory used (the time difference in each step)
  • $\displaystyle Q_{new}(s,a) = Q_{old}(s,a) + \alpha(R + \gamma* \max_a (Q_{old}(s',a)) - Q_{old}(s,a))$
    • $\displaystyle Q_{new}(s,a) = (1-\alpha)Q_{old}(s,a) + \alpha(R + \gamma* \max_a Q_{old}(s',a))$
    • $(1-\alpha)Q_{old}(s,a)$
      • Represents how fast to forget the old value
    • $\alpha(R + \gamma* \max_a Q_{old}(s',a)$
      • Represents how fast to learn the new values
  • Epsilon soft policy
    • Used in action_function(s) to select some action

Pseudocode

For each episode    
    For each step in episode
        action <- action_function(s)
        Take action and observe s', reward R
        Q_new(s,a) = Q_old(s,a) + alpha(R + gamma*max_a(Q_old(s',a)) - Q_old(s,a))
    End
End

Links:

⚠️ **GitHub.com Fallback** ⚠️