Q learning

Learn the optimal action for each state
- State transition probability not known
- Use experience to learn the best action
Utilise temporal difference to make updates
- At state s, take action a
- Observe the next time step s' and reward r
- A single look ahead trajectory used (the time difference in each step)
$\displaystyle Q_{new}(s,a) = Q_{old}(s,a) + \alpha(R + \gamma* \max_a (Q_{old}(s',a)) - Q_{old}(s,a))$
- $\displaystyle Q_{new}(s,a) = (1-\alpha)Q_{old}(s,a) + \alpha(R + \gamma* \max_a Q_{old}(s',a))$
- $(1-\alpha)Q_{old}(s,a)$
  - Represents how fast to forget the old value
- $\alpha(R + \gamma* \max_a Q_{old}(s',a)$
  - Represents how fast to learn the new values
Epsilon soft policy
- Used in action_function(s) to select some action

Pseudocode

For each episode    
    For each step in episode
        action <- action_function(s)
        Take action and observe s', reward R
        Q_new(s,a) = Q_old(s,a) + alpha(R + gamma*max_a(Q_old(s',a)) - Q_old(s,a))
    End
End

Links:

Q learning - HiIAmTzeKean/SC3000-Artificial-Intelligence GitHub Wiki

Q learning

Pseudocode

⚠️ GitHub.com Fallback ⚠️

Q learning - HiIAmTzeKean/SC3000-Artificial-Intelligence GitHub Wiki

Q learning

Pseudocode

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️