PPO - chunhualiao/public-docs GitHub Wiki
- Actor-critic algorithm
- PPO:implementation
- GRPO Proximal Policy Optimization
Proximal Policy Optimization (PPO) Explained
Proximal Policy Optimization (PPO) is a reinforcement learning (RL) algorithm that improves on traditional policy gradient methods. It is an on-policy algorithm that seeks to maximize the expected cumulative reward while keeping policy updates constrained, ensuring stability in training.
Mathematical Formulation of PPO
PPO improves policy gradient methods by using a clipped surrogate objective function to prevent overly large policy updates. The main objective function in PPO is:
J(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]
where:
- $\theta$ are the parameters of the policy $\pi_{\theta}$.
- $r_t(\theta)$ is the probability ratio between the new and old policies:
r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}
It represents how much the new policy deviates from the old one.
- $A_t$ is the advantage function, which measures how much better an action is compared to the expected value.
- Clipping Mechanism: The
clip
function ensures that the ratio $r_t(\theta)$ stays within the range $[1-\epsilon, 1+\epsilon]$, preventing excessively large policy updates. - $\epsilon$ (typically 0.1 - 0.2) is a small hyperparameter controlling the allowed deviation from the old policy.
The clipping mechanism stabilizes training by discouraging extreme updates that could destabilize learning.
Breaking Down the PPO Objective Function in Detail
The main objective function in Proximal Policy Optimization (PPO) is:
J(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]
Now, let's pronounce and explain each term in detail.
1. Understanding the Notation
Each component of the equation has a specific meaning:
Symbol | Pronunciation | Meaning |
---|---|---|
$J(\theta)$ | "J of theta" | The objective function we aim to maximize, which represents policy performance. |
$\mathbb{E}_t$ | "Expectation over time step t" | The expectation (average) over all timesteps in a trajectory. |
$\min$ | "Minimum" | Ensures conservative policy updates by choosing the smaller value. |
$r_t(\theta)$ | "R sub t of theta" | The probability ratio between the new and old policy, representing how much the policy has changed. |
$A_t$ | "A sub t" | The advantage function, which quantifies whether an action was better or worse than expected. |
$\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)$ | "Clipped r sub t of theta between one minus epsilon and one plus epsilon" | A mechanism that limits how much the policy can change, preventing instability. |
$\epsilon$ | "Epsilon" | A small hyperparameter (e.g., 0.1 - 0.2) that controls the allowed deviation from the old policy. |
2. Step-by-Step Explanation
The objective function $J(\theta)$ ensures that updates to the policy are controlled and stable. Let's break it down in three key steps:
Step 1: Compute the Probability Ratio
We first calculate the probability ratio $r_t(\theta)$, which tells us how much the new policy $\pi_{\theta}$ differs from the old policy $\pi_{\theta_{\text{old}}}$ for a given action:
r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}
- If $r_t(\theta) = 1$, the new and old policies are identical.
- If $r_t(\theta) > 1$, the new policy assigns higher probability to the action than before.
- If $r_t(\theta) < 1$, the new policy assigns lower probability to the action than before.
Step 2: Compute the Policy Improvement Term
Next, we multiply $r_t(\theta)$ by the advantage function $A_t$:
r_t(\theta) A_t
- If $A_t > 0$, the action was better than expected, and we want to increase its probability.
- If $A_t < 0$, the action was worse than expected, and we want to decrease its probability.
Step 3: Clipping to Prevent Instability
Instead of applying $r_t(\theta) A_t$ directly, PPO clips it to prevent excessively large updates:
\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t
- If $r_t(\theta)$ is too large (i.e., the policy update is too aggressive), it gets clipped to $1+\epsilon$.
- If $r_t(\theta)$ is too small (i.e., the policy update is too conservative), it gets clipped to $1-\epsilon$.
- This prevents unstable updates and ensures the policy does not change too drastically in one step.
Step 4: Taking the Minimum
PPO chooses the more conservative update by taking the minimum of the two terms:
\min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right)
- If $r_t(\theta) A_t$ is already within the clipping range, it remains unchanged.
- If $r_t(\theta) A_t$ is too large, the clipped version is used instead.
This guarantees stability by ensuring that the policy update is not too drastic.
3. Why This Works
PPO improves policy gradient methods by:
- Avoiding excessively large updates that could destabilize learning.
- Ensuring monotonic improvement, meaning that the policy gets better step by step.
- Reducing variance in training while keeping sample efficiency.
4. Summary
- $J(\theta)$ is the objective function we optimize.
- $r_t(\theta)$ is the probability ratio that tells us how much the policy has changed.
- $A_t$ measures whether an action was better or worse than expected.
- Clipping ensures stability by limiting how much the policy can change in one update.
- Taking the minimum value prevents overly aggressive updates, ensuring smoother training.
This makes PPO one of the most reliable and stable reinforcement learning algorithms used in robotics, game AI, and continuous control tasks.
High-Level Python Code for PPO
Below is a simple PPO implementation in PyTorch:
Here’s the updated table with a new column comparing PPO to Trust Region Policy Optimization (TRPO), the classic algorithm that PPO improves upon.
Comparison of PPO, Q-Learning (DQN), and TRPO
Feature | PPO (Policy-Based) | DQN (Q-Learning, Value-Based) | TRPO (Policy-Based, Classic Algorithm PPO Improves On) |
---|---|---|---|
Approach | Directly optimizes the policy | Estimates action-value function (Q-values) | Directly optimizes the policy with trust region constraints |
Exploration | Uses stochastic policies | Uses an ε-greedy policy | Uses stochastic policies |
Update Stability | Clipped policy updates prevent instability | Uses a target network and experience replay to stabilize | Uses a KL-divergence trust region constraint to limit policy changes |
Action Space | Works well for continuous and discrete action spaces | Works best for discrete action spaces | Works well for continuous and discrete action spaces |
Training Efficiency | More sample efficient and parallelizable | Needs more interactions but less computational cost per update | Less sample efficient due to complex constraint calculations |
Computational Complexity | Lower than TRPO, does not require second-order optimization | Lower complexity, only requires backpropagation | High complexity, involves solving a constrained optimization problem |
Update Mechanism | Uses clipping to constrain policy updates | Uses Bellman equation to update Q-values | Uses KL-divergence constraint to ensure smooth updates |
Use Cases | Robotics, continuous control, large action spaces | Simple discrete action problems like Atari, CartPole | Robotics, continuous control, but with higher computational cost |
Main Improvement Over Previous | More stable than TRPO due to clipped objective function | More stable than basic Q-learning due to experience replay and target network | Solves traditional policy gradient issues but requires expensive second-order optimization |
Why PPO Is an Improvement Over TRPO
PPO was designed as a simpler alternative to TRPO:
- TRPO uses a KL-divergence constraint, which requires solving a complicated constrained optimization problem.
- PPO replaces this with a simple clipping mechanism, which removes the need for second-order optimization while still maintaining stability.
- PPO is easier to implement, computationally cheaper, and more scalable than TRPO.
Thus, PPO retains the stability of TRPO but is more efficient and widely used in real-world applications.