Epsilon greedy policy - chunhualiao/public-docs GitHub Wiki

Epsilon-Greedy Policy in Reinforcement Learning

How does it get its name?

The epsilon-greedy policy gets its name from the Greek letter ε (epsilon), which represents a small probability used for random exploration. The "greedy" part of the name refers to the policy's tendency to exploit the best-known action most of the time while occasionally exploring other actions with probability ε.

What does it do?

The epsilon-greedy policy is a method used in reinforcement learning (RL) to balance exploration (trying new actions to discover better rewards) and exploitation (choosing the best-known action to maximize immediate rewards). It helps an agent learn an optimal policy while avoiding getting stuck in suboptimal choices.

How does it do it?

The epsilon-greedy policy works as follows:

Select an action:
- With probability 1 - ε, choose the action that has the highest estimated value (greedy action).
- With probability ε, choose a random action uniformly from the available actions (exploration).
Update action values:
- After taking the action, the agent observes the reward and updates its knowledge of that action's value.
Repeat the process:
- Over time, the agent refines its estimates of action values, gradually converging towards the optimal policy.

A common approach is to decrease ε over time (e.g., ε = 1/t where t is the timestep), allowing more exploration in the beginning and more exploitation later as the agent becomes more confident in its learned values.

Examples

Example 1: Multi-Armed Bandit Problem

Imagine a gambler playing a row of slot machines (each called a "one-armed bandit"). The gambler doesn't know which machine gives the highest payout. The epsilon-greedy policy would:

Exploit the best-performing slot machine most of the time (1 - ε probability).
Explore other slot machines occasionally (ε probability) to see if there is a better one.

This ensures that the gambler eventually finds and plays the most profitable slot machine while still checking others just in case.

Example 2: Robot Learning to Navigate a Maze

A robot navigating a maze uses reinforcement learning to reach the goal. If the robot always chooses the best-known path, it might miss shortcuts or better strategies. If it only explores randomly, it won't efficiently learn the best route. Using an epsilon-greedy policy:

It mostly follows paths it has learned are good (exploitation).
It sometimes tries new paths (exploration), which might lead to discovering a shorter route to the goal.

Over time, the robot optimizes its path efficiently.

Why is the Epsilon-Greedy Policy Useful?

Prevents getting stuck in local optima: Purely greedy policies might miss better options.
Simple to implement: Requires only a probability check for choosing between exploration and exploitation.
Ensures sufficient exploration: The agent gathers enough information to make informed decisions.

Would you like a Python implementation of the epsilon-greedy algorithm to see it in action? 🚀

Mathematical Explanation of the Epsilon-Greedy Policy

The epsilon-greedy policy involves probabilistic decision-making and incremental updates of action values based on received rewards. Let's break down the math behind it.

1. Action Selection in Epsilon-Greedy

At each time step ( t ), an agent must choose an action ( a ) from a set of ( n ) possible actions:

$A = {a_1, a_2, \dots, a_n}$

The selection follows this rule:

[ a_t = \begin{cases} \arg\max Q_t(a), & \text{with probability } 1 - \epsilon_t \quad \text{(exploit)} \ \text{random action from } A, & \text{with probability } \epsilon_t \quad \text{(explore)} \end{cases} ]

Where:

( Q_t(a) ) is the estimated value of action ( a ) at time ( t ).
( \epsilon_t ) is the probability of exploring at time ( t ).

The exploration probability ( \epsilon ) often decays over time to encourage more exploitation as learning progresses:

$\epsilon_t = \epsilon_{\min} + (\epsilon_{\max} - \epsilon_{\min}) e^{-\lambda t}$

where:

( \epsilon_{\max} ) is the initial exploration probability (e.g., 1.0),
( \epsilon_{\min} ) is the minimum exploration probability (e.g., 0.01),
( \lambda ) is the decay rate.

✅ Plot 1 (above): Shows how ( \epsilon_t ) decreases over time, shifting the balance from exploration to exploitation.

2. Updating Action-Value Estimates

Once an action is taken, the agent receives a reward ( R_t ), and it updates the estimated value ( Q_t(a) ) using the incremental mean update rule:

$Q_{t+1}(a) = Q_t(a) + \alpha (R_t - Q_t(a))$

where:

( \alpha ) is the learning rate (commonly ( \alpha = \frac{1}{t} ) for averaging),
( R_t ) is the reward received after choosing action ( a ).

Alternatively, using an averaging approach:

$Q_{t+1}(a) = \frac{1}{N_t} \sum_{i=1}^{N_t} R_i$

where ( N_t ) is the number of times action ( a ) has been chosen.

✅ Plot 2 (above): Shows how the estimated values ( Q_t(a) ) for different actions evolve over time. Initially, estimates fluctuate, but as the agent gathers more experience, they converge to approximate the true action values.

3. Key Insights from the Plots

Exploration decreases over time (first plot), ensuring early randomness but favoring learned actions later.
Action value estimates stabilize (second plot), showing the agent learns which actions are best.

Would you like me to run a full simulation of an epsilon-greedy agent in a real reinforcement learning environment, such as a grid world or a multi-armed bandit? 🚀