Q Learning - Falmouth-Games-Academy/comp250-wiki GitHub Wiki
What is Q-Learning
Q-Learning is a learning algorithm centered around finding the best possible action based on the current state. It is considered off-policy due to the Q-Learning function learning from actions that are outside the current policy, more specifically the algorithm seeks to learn a policy that maximizes total reward.
The Q
The 'Q', in Q-Learning stands for quality, in this context quality is the representative value of an action in order to gain future reward.
Q-Tables
When Q-Learning is executed a table or matrix is created following the shape of [state,action], values are initialized to zero. Values (referred to as Q-Values) are then stored in the Q-Table after each iteration of the algorithm, the table is then used as a reference table for the agent to select the best action for maximum reward.
Why Q-Learning
Imagine when performing an action, we know the expected reward for each step. This could be considered a 'Cheat Sheet' for the agent, allowing the agent to select the most rewarding action to perform.
Eventually the agent will perform the sequence of actions that generate the greatest possible reward, the strategy to achieve this is outlined as:

Source: Ankit Choudhary, https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
In this formula the Q-Value yielded from state(s) and performing action(a) is equal to the reward r(s,a) plus the highest Q-Value possible from the next state. Gamma in this instance is the factor that controls the contribution of rewards in the future.
Q(s',a) depends on Q(s",a) which will then have a coefficient of gamma(the discount factor) squared. Therefore the Q-Value depends on Q-Values for future states as shown below:

Source: Ankit Choudhary, https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
Adjusting the value of the discount factor(Gamma) will increase or diminish the contribution of future rewards.
As this equation is recursive, we can begin by making arbitrary assumptions for all Q-Values. With time and iterations, it will,in theory, eventually converge to the optimal policy. This can be expressed as:

Source: Ankit Choudhary, https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
Alpha is representative of the learning rate. This determines the extent to which new information overrides old information.