PPO - chunhualiao/public-docs GitHub Wiki

Actor-critic algorithm
PPO:implementation
GRPO Proximal Policy Optimization

Proximal Policy Optimization (PPO) Explained

Proximal Policy Optimization (PPO) is a reinforcement learning (RL) algorithm that improves on traditional policy gradient methods. It is an on-policy algorithm that seeks to maximize the expected cumulative reward while keeping policy updates constrained, ensuring stability in training.

Mathematical Formulation of PPO

PPO improves policy gradient methods by using a clipped surrogate objective function to prevent overly large policy updates. The main objective function in PPO is:

J(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]

where:

$\theta$ are the parameters of the policy $\pi_{\theta}$.
$r_t(\theta)$ is the probability ratio between the new and old policies:

r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

It represents how much the new policy deviates from the old one.

$A_t$ is the advantage function, which measures how much better an action is compared to the expected value.
Clipping Mechanism: The clip function ensures that the ratio $r_t(\theta)$ stays within the range $[1-\epsilon, 1+\epsilon]$, preventing excessively large policy updates.
$\epsilon$ (typically 0.1 - 0.2) is a small hyperparameter controlling the allowed deviation from the old policy.

The clipping mechanism stabilizes training by discouraging extreme updates that could destabilize learning.

Breaking Down the PPO Objective Function in Detail

The main objective function in Proximal Policy Optimization (PPO) is:

J(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]

Now, let's pronounce and explain each term in detail.

1. Understanding the Notation

Each component of the equation has a specific meaning:

Symbol	Pronunciation	Meaning
$J(\theta)$	"J of theta"	The objective function we aim to maximize, which represents policy performance.
$\mathbb{E}_t$	"Expectation over time step t"	The expectation (average) over all timesteps in a trajectory.
$\min$	"Minimum"	Ensures conservative policy updates by choosing the smaller value.
$r_t(\theta)$	"R sub t of theta"	The probability ratio between the new and old policy, representing how much the policy has changed.
$A_t$	"A sub t"	The advantage function, which quantifies whether an action was better or worse than expected.
$\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)$	"Clipped r sub t of theta between one minus epsilon and one plus epsilon"	A mechanism that limits how much the policy can change, preventing instability.
$\epsilon$	"Epsilon"	A small hyperparameter (e.g., 0.1 - 0.2) that controls the allowed deviation from the old policy.

2. Step-by-Step Explanation

The objective function $J(\theta)$ ensures that updates to the policy are controlled and stable. Let's break it down in three key steps:

Step 1: Compute the Probability Ratio

We first calculate the probability ratio $r_t(\theta)$, which tells us how much the new policy $\pi_{\theta}$ differs from the old policy $\pi_{\theta_{\text{old}}}$ for a given action:

r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

If $r_t(\theta) = 1$, the new and old policies are identical.
If $r_t(\theta) > 1$, the new policy assigns higher probability to the action than before.
If $r_t(\theta) < 1$, the new policy assigns lower probability to the action than before.

Step 2: Compute the Policy Improvement Term

Next, we multiply $r_t(\theta)$ by the advantage function $A_t$:

r_t(\theta) A_t

If $A_t > 0$, the action was better than expected, and we want to increase its probability.
If $A_t < 0$, the action was worse than expected, and we want to decrease its probability.

Step 3: Clipping to Prevent Instability

Instead of applying $r_t(\theta) A_t$ directly, PPO clips it to prevent excessively large updates:

\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t

If $r_t(\theta)$ is too large (i.e., the policy update is too aggressive), it gets clipped to $1+\epsilon$.
If $r_t(\theta)$ is too small (i.e., the policy update is too conservative), it gets clipped to $1-\epsilon$.
This prevents unstable updates and ensures the policy does not change too drastically in one step.

Step 4: Taking the Minimum

PPO chooses the more conservative update by taking the minimum of the two terms:

\min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right)

If $r_t(\theta) A_t$ is already within the clipping range, it remains unchanged.
If $r_t(\theta) A_t$ is too large, the clipped version is used instead.

This guarantees stability by ensuring that the policy update is not too drastic.

3. Why This Works

PPO improves policy gradient methods by:

Avoiding excessively large updates that could destabilize learning.
Ensuring monotonic improvement, meaning that the policy gets better step by step.
Reducing variance in training while keeping sample efficiency.

4. Summary

$J(\theta)$ is the objective function we optimize.
$r_t(\theta)$ is the probability ratio that tells us how much the policy has changed.
$A_t$ measures whether an action was better or worse than expected.
Clipping ensures stability by limiting how much the policy can change in one update.
Taking the minimum value prevents overly aggressive updates, ensuring smoother training.

This makes PPO one of the most reliable and stable reinforcement learning algorithms used in robotics, game AI, and continuous control tasks.

High-Level Python Code for PPO

Below is a simple PPO implementation in PyTorch:

PPO:implementation

Here’s the updated table with a new column comparing PPO to Trust Region Policy Optimization (TRPO), the classic algorithm that PPO improves upon.

Comparison of PPO, Q-Learning (DQN), and TRPO

Feature	PPO (Policy-Based)	DQN (Q-Learning, Value-Based)	TRPO (Policy-Based, Classic Algorithm PPO Improves On)
Approach	Directly optimizes the policy	Estimates action-value function (Q-values)	Directly optimizes the policy with trust region constraints
Exploration	Uses stochastic policies	Uses an ε-greedy policy	Uses stochastic policies
Update Stability	Clipped policy updates prevent instability	Uses a target network and experience replay to stabilize	Uses a KL-divergence trust region constraint to limit policy changes
Action Space	Works well for continuous and discrete action spaces	Works best for discrete action spaces	Works well for continuous and discrete action spaces
Training Efficiency	More sample efficient and parallelizable	Needs more interactions but less computational cost per update	Less sample efficient due to complex constraint calculations
Computational Complexity	Lower than TRPO, does not require second-order optimization	Lower complexity, only requires backpropagation	High complexity, involves solving a constrained optimization problem
Update Mechanism	Uses clipping to constrain policy updates	Uses Bellman equation to update Q-values	Uses KL-divergence constraint to ensure smooth updates
Use Cases	Robotics, continuous control, large action spaces	Simple discrete action problems like Atari, CartPole	Robotics, continuous control, but with higher computational cost
Main Improvement Over Previous	More stable than TRPO due to clipped objective function	More stable than basic Q-learning due to experience replay and target network	Solves traditional policy gradient issues but requires expensive second-order optimization

Why PPO Is an Improvement Over TRPO

PPO was designed as a simpler alternative to TRPO:

TRPO uses a KL-divergence constraint, which requires solving a complicated constrained optimization problem.
PPO replaces this with a simple clipping mechanism, which removes the need for second-order optimization while still maintaining stability.
PPO is easier to implement, computationally cheaper, and more scalable than TRPO.

Thus, PPO retains the stability of TRPO but is more efficient and widely used in real-world applications.