Epsilon greedy policy - chunhualiao/public-docs GitHub Wiki
Epsilon-Greedy Policy in Reinforcement Learning
How does it get its name?
The epsilon-greedy policy gets its name from the Greek letter ε (epsilon), which represents a small probability used for random exploration. The "greedy" part of the name refers to the policy's tendency to exploit the best-known action most of the time while occasionally exploring other actions with probability ε.
What does it do?
The epsilon-greedy policy is a method used in reinforcement learning (RL) to balance exploration (trying new actions to discover better rewards) and exploitation (choosing the best-known action to maximize immediate rewards). It helps an agent learn an optimal policy while avoiding getting stuck in suboptimal choices.
How does it do it?
The epsilon-greedy policy works as follows:
-
Select an action:
- With probability 1 - ε, choose the action that has the highest estimated value (greedy action).
- With probability ε, choose a random action uniformly from the available actions (exploration).
-
Update action values:
- After taking the action, the agent observes the reward and updates its knowledge of that action's value.
-
Repeat the process:
- Over time, the agent refines its estimates of action values, gradually converging towards the optimal policy.
A common approach is to decrease ε over time (e.g., ε = 1/t where t is the timestep), allowing more exploration in the beginning and more exploitation later as the agent becomes more confident in its learned values.
Examples
Example 1: Multi-Armed Bandit Problem
Imagine a gambler playing a row of slot machines (each called a "one-armed bandit"). The gambler doesn't know which machine gives the highest payout. The epsilon-greedy policy would:
- Exploit the best-performing slot machine most of the time (1 - ε probability).
- Explore other slot machines occasionally (ε probability) to see if there is a better one.
This ensures that the gambler eventually finds and plays the most profitable slot machine while still checking others just in case.
Example 2: Robot Learning to Navigate a Maze
A robot navigating a maze uses reinforcement learning to reach the goal. If the robot always chooses the best-known path, it might miss shortcuts or better strategies. If it only explores randomly, it won't efficiently learn the best route. Using an epsilon-greedy policy:
- It mostly follows paths it has learned are good (exploitation).
- It sometimes tries new paths (exploration), which might lead to discovering a shorter route to the goal.
Over time, the robot optimizes its path efficiently.
Why is the Epsilon-Greedy Policy Useful?
- Prevents getting stuck in local optima: Purely greedy policies might miss better options.
- Simple to implement: Requires only a probability check for choosing between exploration and exploitation.
- Ensures sufficient exploration: The agent gathers enough information to make informed decisions.
Would you like a Python implementation of the epsilon-greedy algorithm to see it in action? 🚀
Mathematical Explanation of the Epsilon-Greedy Policy
The epsilon-greedy policy involves probabilistic decision-making and incremental updates of action values based on received rewards. Let's break down the math behind it.
1. Action Selection in Epsilon-Greedy
At each time step ( t ), an agent must choose an action ( a ) from a set of ( n ) possible actions:
$A = {a_1, a_2, \dots, a_n}$
The selection follows this rule:
[ a_t = \begin{cases} \arg\max Q_t(a), & \text{with probability } 1 - \epsilon_t \quad \text{(exploit)} \ \text{random action from } A, & \text{with probability } \epsilon_t \quad \text{(explore)} \end{cases} ]
Where:
- ( Q_t(a) ) is the estimated value of action ( a ) at time ( t ).
- ( \epsilon_t ) is the probability of exploring at time ( t ).
The exploration probability ( \epsilon ) often decays over time to encourage more exploitation as learning progresses:
$\epsilon_t = \epsilon_{\min} + (\epsilon_{\max} - \epsilon_{\min}) e^{-\lambda t}$
where:
- ( \epsilon_{\max} ) is the initial exploration probability (e.g., 1.0),
- ( \epsilon_{\min} ) is the minimum exploration probability (e.g., 0.01),
- ( \lambda ) is the decay rate.
✅ Plot 1 (above): Shows how ( \epsilon_t ) decreases over time, shifting the balance from exploration to exploitation.
2. Updating Action-Value Estimates
Once an action is taken, the agent receives a reward ( R_t ), and it updates the estimated value ( Q_t(a) ) using the incremental mean update rule:
$Q_{t+1}(a) = Q_t(a) + \alpha (R_t - Q_t(a))$
where:
- ( \alpha ) is the learning rate (commonly ( \alpha = \frac{1}{t} ) for averaging),
- ( R_t ) is the reward received after choosing action ( a ).
Alternatively, using an averaging approach:
$Q_{t+1}(a) = \frac{1}{N_t} \sum_{i=1}^{N_t} R_i$
where ( N_t ) is the number of times action ( a ) has been chosen.
✅ Plot 2 (above): Shows how the estimated values ( Q_t(a) ) for different actions evolve over time. Initially, estimates fluctuate, but as the agent gathers more experience, they converge to approximate the true action values.
3. Key Insights from the Plots
- Exploration decreases over time (first plot), ensuring early randomness but favoring learned actions later.
- Action value estimates stabilize (second plot), showing the agent learns which actions are best.
Would you like me to run a full simulation of an epsilon-greedy agent in a real reinforcement learning environment, such as a grid world or a multi-armed bandit? 🚀