PPO - chunhualiao/public-docs GitHub Wiki

reinforcement learning>

Proximal Policy Optimization (PPO) Explained

Proximal Policy Optimization (PPO) is a reinforcement learning (RL) algorithm that improves on traditional policy gradient methods. It is an on-policy algorithm that seeks to maximize the expected cumulative reward while keeping policy updates constrained, ensuring stability in training.

Mathematical Formulation of PPO

PPO improves policy gradient methods by using a clipped surrogate objective function to prevent overly large policy updates. The main objective function in PPO is:

J(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]

where:

  • $\theta$ are the parameters of the policy $\pi_{\theta}$.
  • $r_t(\theta)$ is the probability ratio between the new and old policies:
r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

It represents how much the new policy deviates from the old one.

  • $A_t$ is the advantage function, which measures how much better an action is compared to the expected value.
  • Clipping Mechanism: The clip function ensures that the ratio $r_t(\theta)$ stays within the range $[1-\epsilon, 1+\epsilon]$, preventing excessively large policy updates.
  • $\epsilon$ (typically 0.1 - 0.2) is a small hyperparameter controlling the allowed deviation from the old policy.

The clipping mechanism stabilizes training by discouraging extreme updates that could destabilize learning.

Breaking Down the PPO Objective Function in Detail

The main objective function in Proximal Policy Optimization (PPO) is:

J(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]

Now, let's pronounce and explain each term in detail.


1. Understanding the Notation

Each component of the equation has a specific meaning:

Symbol Pronunciation Meaning
$J(\theta)$ "J of theta" The objective function we aim to maximize, which represents policy performance.
$\mathbb{E}_t$ "Expectation over time step t" The expectation (average) over all timesteps in a trajectory.
$\min$ "Minimum" Ensures conservative policy updates by choosing the smaller value.
$r_t(\theta)$ "R sub t of theta" The probability ratio between the new and old policy, representing how much the policy has changed.
$A_t$ "A sub t" The advantage function, which quantifies whether an action was better or worse than expected.
$\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)$ "Clipped r sub t of theta between one minus epsilon and one plus epsilon" A mechanism that limits how much the policy can change, preventing instability.
$\epsilon$ "Epsilon" A small hyperparameter (e.g., 0.1 - 0.2) that controls the allowed deviation from the old policy.

2. Step-by-Step Explanation

The objective function $J(\theta)$ ensures that updates to the policy are controlled and stable. Let's break it down in three key steps:

Step 1: Compute the Probability Ratio

We first calculate the probability ratio $r_t(\theta)$, which tells us how much the new policy $\pi_{\theta}$ differs from the old policy $\pi_{\theta_{\text{old}}}$ for a given action:

r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}
  • If $r_t(\theta) = 1$, the new and old policies are identical.
  • If $r_t(\theta) > 1$, the new policy assigns higher probability to the action than before.
  • If $r_t(\theta) < 1$, the new policy assigns lower probability to the action than before.

Step 2: Compute the Policy Improvement Term

Next, we multiply $r_t(\theta)$ by the advantage function $A_t$:

r_t(\theta) A_t
  • If $A_t > 0$, the action was better than expected, and we want to increase its probability.
  • If $A_t < 0$, the action was worse than expected, and we want to decrease its probability.

Step 3: Clipping to Prevent Instability

Instead of applying $r_t(\theta) A_t$ directly, PPO clips it to prevent excessively large updates:

\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t
  • If $r_t(\theta)$ is too large (i.e., the policy update is too aggressive), it gets clipped to $1+\epsilon$.
  • If $r_t(\theta)$ is too small (i.e., the policy update is too conservative), it gets clipped to $1-\epsilon$.
  • This prevents unstable updates and ensures the policy does not change too drastically in one step.

Step 4: Taking the Minimum

PPO chooses the more conservative update by taking the minimum of the two terms:

\min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right)
  • If $r_t(\theta) A_t$ is already within the clipping range, it remains unchanged.
  • If $r_t(\theta) A_t$ is too large, the clipped version is used instead.

This guarantees stability by ensuring that the policy update is not too drastic.


3. Why This Works

PPO improves policy gradient methods by:

  1. Avoiding excessively large updates that could destabilize learning.
  2. Ensuring monotonic improvement, meaning that the policy gets better step by step.
  3. Reducing variance in training while keeping sample efficiency.

4. Summary

  • $J(\theta)$ is the objective function we optimize.
  • $r_t(\theta)$ is the probability ratio that tells us how much the policy has changed.
  • $A_t$ measures whether an action was better or worse than expected.
  • Clipping ensures stability by limiting how much the policy can change in one update.
  • Taking the minimum value prevents overly aggressive updates, ensuring smoother training.

This makes PPO one of the most reliable and stable reinforcement learning algorithms used in robotics, game AI, and continuous control tasks.

High-Level Python Code for PPO

Below is a simple PPO implementation in PyTorch:


Here’s the updated table with a new column comparing PPO to Trust Region Policy Optimization (TRPO), the classic algorithm that PPO improves upon.

Comparison of PPO, Q-Learning (DQN), and TRPO

Feature PPO (Policy-Based) DQN (Q-Learning, Value-Based) TRPO (Policy-Based, Classic Algorithm PPO Improves On)
Approach Directly optimizes the policy Estimates action-value function (Q-values) Directly optimizes the policy with trust region constraints
Exploration Uses stochastic policies Uses an ε-greedy policy Uses stochastic policies
Update Stability Clipped policy updates prevent instability Uses a target network and experience replay to stabilize Uses a KL-divergence trust region constraint to limit policy changes
Action Space Works well for continuous and discrete action spaces Works best for discrete action spaces Works well for continuous and discrete action spaces
Training Efficiency More sample efficient and parallelizable Needs more interactions but less computational cost per update Less sample efficient due to complex constraint calculations
Computational Complexity Lower than TRPO, does not require second-order optimization Lower complexity, only requires backpropagation High complexity, involves solving a constrained optimization problem
Update Mechanism Uses clipping to constrain policy updates Uses Bellman equation to update Q-values Uses KL-divergence constraint to ensure smooth updates
Use Cases Robotics, continuous control, large action spaces Simple discrete action problems like Atari, CartPole Robotics, continuous control, but with higher computational cost
Main Improvement Over Previous More stable than TRPO due to clipped objective function More stable than basic Q-learning due to experience replay and target network Solves traditional policy gradient issues but requires expensive second-order optimization

Why PPO Is an Improvement Over TRPO

PPO was designed as a simpler alternative to TRPO:

  • TRPO uses a KL-divergence constraint, which requires solving a complicated constrained optimization problem.
  • PPO replaces this with a simple clipping mechanism, which removes the need for second-order optimization while still maintaining stability.
  • PPO is easier to implement, computationally cheaper, and more scalable than TRPO.

Thus, PPO retains the stability of TRPO but is more efficient and widely used in real-world applications.