Part 6: Reinforcement Learning - oooookk7/machine-learning-a-to-z GitHub Wiki

What is Reinforcement Learning in ML?

(Source: Ali Mousavi, 2020)

Reinforcement learning is learning what to do and map situations to actions that lead to the maximized numerical reward signal with a given task or problem statement in mind. The learner is not given an idea of the actions to take but instead "learns" to discover which action to take that yields the maximum reward.

Some of the key terms are,

Agent: Entity that perceives and explores the environment and acts upon it.
Environment: Situation in which the agent is present or surrounded by and learns from.
Action: Actions that are moves taken by the agent within the environment.
State: Situation returned by the environment after each action taken by the agent.
Reward: Feedback returned to the agent from the environment to evaluate the action of the agent.
Policy: Strategy applied by the agent for the next action based on the current state, and it can be a simple function or a lookup table.
Value: Expected long-term return with discount factor opposed to the short-term reward.
Discount: Factor to reduce the long-term reward for adjustment (e.g. adjustment for a short-sighted robot that cannot optimize sum reward with uncertainty when it would be powered off).
Model: Mimics the behavior of the environment so that inferences can be made on how the environment will behave with the help of the model, which is useful for planning in consideration of all future situations before implementation.

Types of tasks,

Episodic: Have a terminal state and each episode is independent (e.g. racing game).
Continuous: Does not have a terminate state and task goes on and on (e.g. improving to code well).

Approaches in Implementation

(Source: PRATHIMA23, 2021)

(Source: Lilian Weng, 2018)

There are mainly three ways to implement,

Value-based: The approach is to find the optimal value function that gives the maximum value at a state under any policy, which the agent expects the long-term return at any state(s) under its policy.
- Examples are Q-learning, SARSA, and TD learning.
Policy-based: The approach is to find the optimal policy that gives the maximum future rewards without using the value function. The agent tries to apply such a policy that the action performed in each step helps to maximize the future reward.
- Algorithms that are not policy-based are called off-policy (and the converse is called on-policy).
- There are mainly two types of policy:
  - Deterministic (Determined): The same action is produced by the policy at any state.
  - Stochastic (Random): Probability determines the produced action.
- Examples are REINFORCE method.
Model-based: The approach is to use a virtual model created for the environment, and let the agent explore the environment to learn from it.
- Examples are the policy and value iterative methods.

A combined value and policy based is the Actor-critic algorithm (a policy gradient algorithm), where the algorithm use the policy (the actor) determines the action that must be taken in any state, and the state-action function (the critic) serves the purpose of "criticizing" the action fo the policy (the actor) by giving it a quality-value.

Why does it differ from Supervised and Unsupervised learning?

(Source: Faizan Shaikh, 2017)

In supervised learning, there is an external "supervisor" who shares with the agent on who to complete the task, but for reinforcement learning, there are problems which are combinations of subtasks that the agent can perform to achieve the objective. Although there is a mapping between input and output, it differs as reinforcement learning provides a reward function that acts as feedback as opposed to supervised learning.

In unsupervised learning, there isn't a mapping from input to output, and the unsupervised learning algorithm objective is to find patterns in the data set, whereas reinforcement learning is about building a "knowledge graph" based on the constant feedback received from the reward function (positive or negative).

Markov decision process (MDP)

(Source: Steve Roberts, 2021)

The basic idea of the entire task is modeled as a MDP as tuple .

This is the set of environment and agent states,

This is the set of actions the agent takes,

This is the state transition probability matrix (aka the probability state 1 transits to state 2),

The probability that action at state at time will lead to state at time can be interpreted as such,

The Markov process is the memory-less random process (e.g. a sequence of random states). The agent starts at state and choose action , which causes a random transition to the next state drawn from . This process continues for state from to state , which gives the following process,

The reward function that returns the immediate reward,

It takes in account of the current action and current state and next state from the matrices provided to return the immediate reward. represents the returned reward at time .

The discount factor to reduce long-term reward,

It has a value between 0 - 1, and set to 0 means that more importance is given to the immediate reward, and a value of 1 means more importance is given to future rewards.

What are we trying to achieve?

The objective is for the agent to learn the most optimal policy that will maximize the cumulative reward instead of the immediate reward the agent, and this total sum of reward is called returns,

Where is the reward received by the agent at step while performing action to transition to another state .

The discount factor is added for continuous tasks (think of geometric series) as it never ends, so a factor is needed for the agent to re-think about future rewards as it may sum to infinity.

Now onto the policy, it is a probability distribution over action for each state where,

This describes that if an agent at time follows the policy then is the probability that the agent is taking action at particular time step .

(Source: Jonathan Hui, 2018 - Value function in representation of matrix for the reward at a point after several iterations)

The value function (also known as the state-value function) determines how good it is for an agent to stay at a particular state . When an agent is following the policy for the next states until it reaches the terminate state, it is denoted as this for any state ,

This gives the expected returns starting from the initing state to the successor states based on the policy , and the value of a state is not stochastic even if the returns are stochastic as it is merely the expectation of returns. The terminal state will also return 0 always.

(Source: baeldung, 2021 - The idea of an action transitioning to the next state and its relevant reward)

The q-function (also known as the state-action value function) determines how good it is for an agent to take any action in a state

This gives us the value of performing action in a state with policy .

Bellman Equation

(source: algodaily.com, n.d.)

Dynamic programming (DP) is an algorithmic technique for solving an optimization problem by breaking it down into simpler subproblems and then solving these sub-problems recursively (e.g. fibonacci where fib(n) = fib(n-1) + fib(n-2), for n > 1).

The Bellman equation is a mathematical optimization method associated with DP that expresses a recursive relationship between the values of the states, that is the value of being in state in addition to being in the state we might transition into . This relationship allows us to form updated rules to approximate the later state-value and state-action function of the MDP.

(Source: Yash Bonde, 2019 - State-value function representation)

The Bellman equation for the state-value function is represented as such,

(Source: Lilian Weng, 2018 - State-action function representation)

The Bellman equation for the state-action function is represented as such,

Seeing the above diagram, we can see the relations between and ,

Finding the Optimal Function

(Source: Yash Bonde, 2019 - Backup diagrams. Left is state-value function, Right is the state-action function. With )

(Source: Aditya Rastogi, 2019)

The general idea is to find the optimal policy such that,

Implementations and Approaches

Here's the general overview of what each approaches aims to achieve,

Given policy , what are and ?
Given policy , can we create another policy such that where ?
How can we find the optimal policy ?

Types of policies

Deterministic: For every state there is a clear action take the agent can take (e.g. we know for 100% that action will be taken from state ).
- One example is .
Stochastic: For every state , there is no clear defined action , and actions taken are in a form of probability distribution (e.g. 70% chance of action instead of 20% and 10% .
- One example is .

Types of approaches

(Source: George Christopoulos et., 2017)

(Source: Sadid Hasan, 2018)

The problem is that we don't know what the the model parameters such as the and reward function are, so the ways to approach it are,

Model-based approach where it attempts to learn the model parameters and understand the environment to estimate the value function by iterating every state and action.
- Examples are Policy Iteration and Value Iteration.
Model-free approach where it attempts to estimate the best value function through experience based on rewards received for each state and action.
- Examples are the Monte Carlo method and the Temporal Difference learning.

Policy Iteration

(Source: Barnabás Póczos, Carnigie Mellon, 2017 Scenario for 4 states (A,B,C,D) and 2 actions (1,2), where policies are to choose only action 1, 2 or mixed)

The general idea can be summed up as,

The algorithm works as such,

Set the initial parameters (e.g. discount factor, convergent value).
Start with a new policy matrix.
Find the optimal state-value matrix for each state ,
1. Obtain the state-value matrix that can be derived from all actions taken .
2. Update the delta if the difference between the prior value (e.g. ) is greater than the existing delta.
3. Tabulate the state-value matrix for the particular state .
4. Check if delta has converged to convergent value (e.g. delta > convergent value).
  - Repeat Step 3 if not convergent.
Find the optimal policy for each state to compare if similar,
1. Obtain the maximum state-action matrix that can be derived from all actions taken .
2. Check ,
  - Repeat Step 4 if not convergent (e.g. does not match the maximized state-action matrix for state and action).
Return the policy and the state-value matrix.

Value iteration

The general idea can be summed up as,

Notice it can be written as such too,

The algorithm works as such,

Set the initial parameters (e.g. discount factor, convergent value).
Start with a new state-value matrix.
Find the optimal state-action matrix for each state ,
1. Obtain the maximum state-action matrix that can be derived from all actions taken .
2. Update the delta if the difference between the prior value (e.g. ) is greater than the existing delta.
3. Tabulate the state-value matrix for the particular state .
Check if delta has converged to convergent value (e.g. delta > convergent value).
- Repeat Step 3 if not convergent.
Obtain the policy from the state-value matrix and return state-value matrix.

Monte Carlo method (MC)

(Source: Jason Brownlee, 2019 - Example of Monte Carlo sampling for the value function probability distribution)

(Source: Sagi Shaier, 2020 - Example of a first-visit MC)

Monte-Carlo methods (or experiments) are algorithms that rely on repeated random sampling to obtain numerical results, which underlying concept is to use randomness to solve problems that might be deterministic in principle. It estimates the value function for a policy by sampling returns till times (until terminal state) and averaging it by .

The general idea can be summed up as,

Where is the bin count of steps for state and action at step .

The reward function is the summation of for state and action at each step .

Temporal Difference

(Source: Lilian Weng, 2018 - Distinct differences between MC and TD approach where one needs to reach terminal to generate an estimate, and the dynamic programming a show of a model-based approach)

Estimates the value function for a policy by learning from past episodes of experience (is able to learn from incomplete episodes) and need not track the episode up to termination (e.g. estimating from short-term differences). During each episode, it explores or exploits randomly through the environment until it reaches the terminal state.

The general idea can be summed up as,

Q-learning

The most commonly used version is Q-Learning, which seeks to optimise the state-action function instead of the state-value function though several experiential episodes.

Once the algorithm reaches the terminal state, the episode ends. A new episode starts again until it completes all episodes. This allows the agent to learn the expected matrix.

As for the parameters, they are affected by,

Learning Rate (): Determines the extent to newly acquired information overrides old information. 0 means the agent learns nothing, and 1 makes the agent consider only the most recent information.
Discount Factor (): Determines the importance of future rewards. 0 makes the agent only consider current rewards (short-sighted), approaching 1 makes it strive for long-term high reward.

Exploration-Exploitation Dilemma

(Source: Lilian Weng, 2018 - Example of exploration vs exploitation)

Your favorite restaurant is right around the corner, and if you go there every day, you would be confident of what you will get but miss the chances of discovering an even better option. If you try new places all the time, very likely you are gonna have to eat unpleasant food from time to time. Similarly, online advisors try to balance between the known most attractive ads and the new ads that might be even more successful.

The Exploration-Exploitation dilemma is a general problem that is encountered in most of the data-driven decision-making process when there exists some kind of feedback loop between data gathering and decisions making.

One famous example is the multi-armed bandit problem.

Multi-Armed Bandit Problem

(Source: Lilian Weng, 2018)

This is a classic problem that well demonstrates the exploration vs exploitation dilemma. Imagine you are in a casino facing multiple slot machines and each is configured with an unknown probability of how likely you can get a reward at one play.

Then, what is the best strategy to achieve the highest long-term rewards?

These are the strategies you can try,

No exploration: Might miss out on good opportunities.
Exploration at random: Might not necessarily yield best results
Balanced: Explore smartly with preference to uncertainty

There are several methods to create the balance,

ε-Greedy Algorithm
Upper Confidence Bounds
Thompson Sampling

ε-Greedy

(Source: oreilly.com, n.d.)

The action selection takes the best action most of the time.

During action selection, it can be summarised as,

Where is a randomized value for each step taken, and a randomized action.

When tabulating the policy it becomes,

Softmax version

For the softmax version, the action selection process is,

Where it iterates through all available actions in the current state .

Upper-Confidence Bound

(Source: samishawl, 2020)

The action selection uses uncertainty in the state-action value estimates for balancing exploration and exploitation. The brackets in the above diagram represent the confidence interval which says how confident that the actual state-action value of action lies somewhere in that region.

The lower bracket is the lower bound, and the upper bracket the upper bound. The region in-between these brackets is the confidence interval which represents the uncertainty in the estimates. If the region is small, it indicates higher certainty that the actual value of action is near the estimated value. If the region is large, it could create uncertainty that the value of action is near the estimated value.

From the above diagram, the UCB picked despite 's smaller region, as it has the highest upper bound. However, over time, the UCB explores to reduce uncertainty and eventually obtains greater reward on average than other algorithms.

(Source: samishawl, 2020)

The action selection can be summarized like this,

Where is a constant parameter where .

Thompson Sampling

(Source: Steve Roberts, 2020)

Thompon sampling starts with all actions having a distribution of the probability of success (e.g. beta distribution). Then an observation is obtained from each action and based on the reward obtain, a new distribution is generated with a probability of success for each slot machine. Further observations based on prior probabilities eventually make up a successful distribution associated with it.

In summary,

For each , create a distribution where , where represents that there is no reward and represents there is.
Take several random sampling from each action , and depending if it results in success or failure, update the distribution accordingly.
The highest sample is the selected .