[25.05.03] Trust Region Policy Optimization - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Trust Region Policy Optimization (TRPO)
Authors: John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel
Published In: ICML (International Conference on Machine Learning)
Year: 2015 (arXiv version discussed: v5, 2017)
Link: https://arxiv.org/abs/1502.05477
Date of Discussion: 2025.05.03

Summary

Research Problem: How to optimize complex, non-linear policies (like neural networks) in reinforcement learning while guaranteeing monotonic performance improvement and avoiding catastrophic drops caused by overly large update steps.
Key Contributions:
- Proposes TRPO, an iterative algorithm theoretically guaranteeing monotonic policy improvement by optimizing a surrogate objective function.
- Introduces the use of a trust region constraint, specifically bounding the KL divergence between the old and new policies, to control the update step size. This is shown to be more robust than using a fixed penalty coefficient.
- Develops practical approximations (e.g., using average KL divergence instead of maximum KL divergence) and sample-based estimation techniques (single-path and vine methods) to make the algorithm feasible.
Methodology/Approach:
- Builds upon the theory of conservative policy iteration.
- Derives a lower bound for policy improvement involving the KL divergence.
- Formulates the optimization as maximizing a surrogate objective subject to a constraint on the average KL divergence (the trust region).
- Uses Monte Carlo sampling (single-path or vine) to estimate objectives (Q-values/advantages) and constraints.
- Employs the conjugate gradient algorithm with Fisher-vector products to approximately solve the constrained optimization problem efficiently.
Results:
- TRPO demonstrates strong and robust performance on diverse tasks, including continuous control (simulated robotics: swimming, hopping, walking) and high-dimensional discrete control (Atari games from pixels).
- Outperforms methods like standard natural gradient, especially on complex tasks, due to more stable step sizes.
- Achieves results competitive with specialized algorithms like DQN and MCTS-based methods on Atari, despite being a general-purpose policy search method.

Discussion Points

Strengths:
- Provides theoretical monotonic improvement guarantees.
- Empirically robust performance across different types of RL tasks.
- The KL constraint offers a more principled and stable way to control step size compared to fixed penalties or learning rates.
- A general algorithm applicable to various policy representations and problems.
Weaknesses:
- Theoretical underpinnings and proofs (especially in appendices) are complex and require significant background.
- Practical implementation relies on approximations (average KL, sampling) that deviate from the strict theory.
- The 'vine' sampling method requires simulation state resets, limiting its use outside simulators.
- Can be computationally more intensive than simpler policy gradient methods (though optimizations like Fisher-vector products help).
Key Questions:
- The justification for using average KL divergence as the constraint instead of the theoretically derived maximum KL divergence, and its practical implications.
- Understanding the relationship between the large penalty coefficient 'C' in the theoretical bound (Eq. 9) and the fixed KL constraint 'delta' (Eq. 11/12) – how the constraint effectively forces small KL divergence.
- Clarification on replacing the advantage function (A) with Q-values and seemingly treating the value function (V) as a constant during the sample-based optimization step.
- The specific reasons why different sampling distributions (q) work better for continuous vs. discrete tasks in the vine method.
- How TRPO might allow for reusing data samples more effectively than standard on-policy methods (mentioned briefly in comparison to PPO).
Applications: Complex control tasks in robotics, game AI (especially from high-dimensional input like pixels), potentially any sequential decision problem needing stable policy updates.
Connections: Directly improves upon Natural Policy Gradient; theoretically based on Conservative Policy Iteration; PPO (Proximal Policy Optimization) was developed later as a simplification of TRPO.

Notes and Reflections

Interesting Insights:
- The core idea of constraining the change in the policy distribution (via KL divergence) is key to stability, acting like an adaptive step size.
- The theoretical penalty formulation (Eq. 9 with large 'C') strongly motivates the practical constraint formulation (Eq. 11/12) by showing that minimizing the penalty requires keeping the KL divergence very small.
- The appendix proofs using policy "coupling" and "disagreement probability" provide a different perspective on bounding the performance difference.
Lessons Learned:
- Explicitly constraining policy updates within a "trust region" is a powerful technique for stable learning with expressive function approximators.
- Bridging RL theory and practice often involves well-motivated approximations.
- Understanding the derivation helps appreciate why the algorithm is designed the way it is (e.g., the KL constraint).
Future Directions: Algorithm simplification (leading to PPO), applying TRPO/related ideas to more complex perception and control problems, combining with model-based approaches.