Epsilon greedy policy - chunhualiao/public-docs GitHub Wiki

Q learning algorithm>

Epsilon-Greedy Policy in Reinforcement Learning

How does it get its name?

The epsilon-greedy policy gets its name from the Greek letter ε (epsilon), which represents a small probability used for random exploration. The "greedy" part of the name refers to the policy's tendency to exploit the best-known action most of the time while occasionally exploring other actions with probability ε.


What does it do?

The epsilon-greedy policy is a method used in reinforcement learning (RL) to balance exploration (trying new actions to discover better rewards) and exploitation (choosing the best-known action to maximize immediate rewards). It helps an agent learn an optimal policy while avoiding getting stuck in suboptimal choices.


How does it do it?

The epsilon-greedy policy works as follows:

  1. Select an action:

    • With probability 1 - ε, choose the action that has the highest estimated value (greedy action).
    • With probability ε, choose a random action uniformly from the available actions (exploration).
  2. Update action values:

    • After taking the action, the agent observes the reward and updates its knowledge of that action's value.
  3. Repeat the process:

    • Over time, the agent refines its estimates of action values, gradually converging towards the optimal policy.

A common approach is to decrease ε over time (e.g., ε = 1/t where t is the timestep), allowing more exploration in the beginning and more exploitation later as the agent becomes more confident in its learned values.


Examples

Example 1: Multi-Armed Bandit Problem

Imagine a gambler playing a row of slot machines (each called a "one-armed bandit"). The gambler doesn't know which machine gives the highest payout. The epsilon-greedy policy would:

  • Exploit the best-performing slot machine most of the time (1 - ε probability).
  • Explore other slot machines occasionally (ε probability) to see if there is a better one.

This ensures that the gambler eventually finds and plays the most profitable slot machine while still checking others just in case.


Example 2: Robot Learning to Navigate a Maze

A robot navigating a maze uses reinforcement learning to reach the goal. If the robot always chooses the best-known path, it might miss shortcuts or better strategies. If it only explores randomly, it won't efficiently learn the best route. Using an epsilon-greedy policy:

  • It mostly follows paths it has learned are good (exploitation).
  • It sometimes tries new paths (exploration), which might lead to discovering a shorter route to the goal.

Over time, the robot optimizes its path efficiently.


Why is the Epsilon-Greedy Policy Useful?

  1. Prevents getting stuck in local optima: Purely greedy policies might miss better options.
  2. Simple to implement: Requires only a probability check for choosing between exploration and exploitation.
  3. Ensures sufficient exploration: The agent gathers enough information to make informed decisions.

Would you like a Python implementation of the epsilon-greedy algorithm to see it in action? 🚀

Mathematical Explanation of the Epsilon-Greedy Policy

The epsilon-greedy policy involves probabilistic decision-making and incremental updates of action values based on received rewards. Let's break down the math behind it.


1. Action Selection in Epsilon-Greedy

At each time step ( t ), an agent must choose an action ( a ) from a set of ( n ) possible actions:

$A = {a_1, a_2, \dots, a_n}$

The selection follows this rule:

[ a_t = \begin{cases} \arg\max Q_t(a), & \text{with probability } 1 - \epsilon_t \quad \text{(exploit)} \ \text{random action from } A, & \text{with probability } \epsilon_t \quad \text{(explore)} \end{cases} ]

Where:

  • ( Q_t(a) ) is the estimated value of action ( a ) at time ( t ).
  • ( \epsilon_t ) is the probability of exploring at time ( t ).

The exploration probability ( \epsilon ) often decays over time to encourage more exploitation as learning progresses:

$\epsilon_t = \epsilon_{\min} + (\epsilon_{\max} - \epsilon_{\min}) e^{-\lambda t}$

where:

  • ( \epsilon_{\max} ) is the initial exploration probability (e.g., 1.0),
  • ( \epsilon_{\min} ) is the minimum exploration probability (e.g., 0.01),
  • ( \lambda ) is the decay rate.

image

Plot 1 (above): Shows how ( \epsilon_t ) decreases over time, shifting the balance from exploration to exploitation.


2. Updating Action-Value Estimates

Once an action is taken, the agent receives a reward ( R_t ), and it updates the estimated value ( Q_t(a) ) using the incremental mean update rule:

$Q_{t+1}(a) = Q_t(a) + \alpha (R_t - Q_t(a))$

where:

  • ( \alpha ) is the learning rate (commonly ( \alpha = \frac{1}{t} ) for averaging),
  • ( R_t ) is the reward received after choosing action ( a ).

Alternatively, using an averaging approach:

$Q_{t+1}(a) = \frac{1}{N_t} \sum_{i=1}^{N_t} R_i$

where ( N_t ) is the number of times action ( a ) has been chosen.

image

Plot 2 (above): Shows how the estimated values ( Q_t(a) ) for different actions evolve over time. Initially, estimates fluctuate, but as the agent gathers more experience, they converge to approximate the true action values.


3. Key Insights from the Plots

  • Exploration decreases over time (first plot), ensuring early randomness but favoring learned actions later.
  • Action value estimates stabilize (second plot), showing the agent learns which actions are best.

Would you like me to run a full simulation of an epsilon-greedy agent in a real reinforcement learning environment, such as a grid world or a multi-armed bandit? 🚀