Q learning - HiIAmTzeKean/SC3000-Artificial-Intelligence GitHub Wiki
tags:
- 🌱
- AI
- ComputerScience date: 20--Feb--2023
- Learn the optimal action for each state
- State transition probability not known
- Use experience to learn the best action
- Utilise temporal difference to make updates
- At state s, take action a
- Observe the next time step s' and reward r
- A single look ahead trajectory used (the time difference in each step)
-
$\displaystyle Q_{new}(s,a) = Q_{old}(s,a) + \alpha(R + \gamma* \max_a (Q_{old}(s',a)) - Q_{old}(s,a))$ $\displaystyle Q_{new}(s,a) = (1-\alpha)Q_{old}(s,a) + \alpha(R + \gamma* \max_a Q_{old}(s',a))$ -
$(1-\alpha)Q_{old}(s,a)$ - Represents how fast to forget the old value
-
$\alpha(R + \gamma* \max_a Q_{old}(s',a)$ - Represents how fast to learn the new values
-
Epsilon soft policy
- Used in action_function(s) to select some action
For each episode
For each step in episode
action <- action_function(s)
Take action and observe s', reward R
Q_new(s,a) = Q_old(s,a) + alpha(R + gamma*max_a(Q_old(s',a)) - Q_old(s,a))
End
End
Links: