Q Learning - Falmouth-Games-Academy/comp250-wiki GitHub Wiki

What is Q-Learning

Q-Learning is a learning algorithm centered around finding the best possible action based on the current state. It is considered off-policy due to the Q-Learning function learning from actions that are outside the current policy, more specifically the algorithm seeks to learn a policy that maximizes total reward.

The Q

The 'Q', in Q-Learning stands for quality, in this context quality is the representative value of an action in order to gain future reward.

Q-Tables

When Q-Learning is executed a table or matrix is created following the shape of [state,action], values are initialized to zero. Values (referred to as Q-Values) are then stored in the Q-Table after each iteration of the algorithm, the table is then used as a reference table for the agent to select the best action for maximum reward.

Why Q-Learning

Imagine when performing an action, we know the expected reward for each step. This could be considered a 'Cheat Sheet' for the agent, allowing the agent to select the most rewarding action to perform.

Eventually the agent will perform the sequence of actions that generate the greatest possible reward, the strategy to achieve this is outlined as:

Q-Learning Formula 1

Source: Ankit Choudhary, https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

In this formula the Q-Value yielded from state(s) and performing action(a) is equal to the reward r(s,a) plus the highest Q-Value possible from the next state. Gamma in this instance is the factor that controls the contribution of rewards in the future.

Q(s',a) depends on Q(s",a) which will then have a coefficient of gamma(the discount factor) squared. Therefore the Q-Value depends on Q-Values for future states as shown below:

Q-Learning Formula 2

Source: Ankit Choudhary, https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

Adjusting the value of the discount factor(Gamma) will increase or diminish the contribution of future rewards.

As this equation is recursive, we can begin by making arbitrary assumptions for all Q-Values. With time and iterations, it will,in theory, eventually converge to the optimal policy. This can be expressed as:

Q-Learning Formula 3

Source: Ankit Choudhary, https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

Alpha is representative of the learning rate. This determines the extent to which new information overrides old information.