[25.05.03] Trust Region Policy Optimization - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

  • Paper Title: Trust Region Policy Optimization (TRPO)
  • Authors: John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel
  • Published In: ICML (International Conference on Machine Learning)
  • Year: 2015 (arXiv version discussed: v5, 2017)
  • Link: https://arxiv.org/abs/1502.05477
  • Date of Discussion: 2025.05.03

Summary

  • Research Problem: How to optimize complex, non-linear policies (like neural networks) in reinforcement learning while guaranteeing monotonic performance improvement and avoiding catastrophic drops caused by overly large update steps.
  • Key Contributions:
    • Proposes TRPO, an iterative algorithm theoretically guaranteeing monotonic policy improvement by optimizing a surrogate objective function.
    • Introduces the use of a trust region constraint, specifically bounding the KL divergence between the old and new policies, to control the update step size. This is shown to be more robust than using a fixed penalty coefficient.
    • Develops practical approximations (e.g., using average KL divergence instead of maximum KL divergence) and sample-based estimation techniques (single-path and vine methods) to make the algorithm feasible.
  • Methodology/Approach:
    • Builds upon the theory of conservative policy iteration.
    • Derives a lower bound for policy improvement involving the KL divergence.
    • Formulates the optimization as maximizing a surrogate objective subject to a constraint on the average KL divergence (the trust region).
    • Uses Monte Carlo sampling (single-path or vine) to estimate objectives (Q-values/advantages) and constraints.
    • Employs the conjugate gradient algorithm with Fisher-vector products to approximately solve the constrained optimization problem efficiently.
  • Results:
    • TRPO demonstrates strong and robust performance on diverse tasks, including continuous control (simulated robotics: swimming, hopping, walking) and high-dimensional discrete control (Atari games from pixels).
    • Outperforms methods like standard natural gradient, especially on complex tasks, due to more stable step sizes.
    • Achieves results competitive with specialized algorithms like DQN and MCTS-based methods on Atari, despite being a general-purpose policy search method.

Discussion Points

  • Strengths:
    • Provides theoretical monotonic improvement guarantees.
    • Empirically robust performance across different types of RL tasks.
    • The KL constraint offers a more principled and stable way to control step size compared to fixed penalties or learning rates.
    • A general algorithm applicable to various policy representations and problems.
  • Weaknesses:
    • Theoretical underpinnings and proofs (especially in appendices) are complex and require significant background.
    • Practical implementation relies on approximations (average KL, sampling) that deviate from the strict theory.
    • The 'vine' sampling method requires simulation state resets, limiting its use outside simulators.
    • Can be computationally more intensive than simpler policy gradient methods (though optimizations like Fisher-vector products help).
  • Key Questions:
    • The justification for using average KL divergence as the constraint instead of the theoretically derived maximum KL divergence, and its practical implications.
    • Understanding the relationship between the large penalty coefficient 'C' in the theoretical bound (Eq. 9) and the fixed KL constraint 'delta' (Eq. 11/12) – how the constraint effectively forces small KL divergence.
    • Clarification on replacing the advantage function (A) with Q-values and seemingly treating the value function (V) as a constant during the sample-based optimization step.
    • The specific reasons why different sampling distributions (q) work better for continuous vs. discrete tasks in the vine method.
    • How TRPO might allow for reusing data samples more effectively than standard on-policy methods (mentioned briefly in comparison to PPO).
  • Applications: Complex control tasks in robotics, game AI (especially from high-dimensional input like pixels), potentially any sequential decision problem needing stable policy updates.
  • Connections: Directly improves upon Natural Policy Gradient; theoretically based on Conservative Policy Iteration; PPO (Proximal Policy Optimization) was developed later as a simplification of TRPO.

Notes and Reflections

  • Interesting Insights:
    • The core idea of constraining the change in the policy distribution (via KL divergence) is key to stability, acting like an adaptive step size.
    • The theoretical penalty formulation (Eq. 9 with large 'C') strongly motivates the practical constraint formulation (Eq. 11/12) by showing that minimizing the penalty requires keeping the KL divergence very small.
    • The appendix proofs using policy "coupling" and "disagreement probability" provide a different perspective on bounding the performance difference.
  • Lessons Learned:
    • Explicitly constraining policy updates within a "trust region" is a powerful technique for stable learning with expressive function approximators.
    • Bridging RL theory and practice often involves well-motivated approximations.
    • Understanding the derivation helps appreciate why the algorithm is designed the way it is (e.g., the KL constraint).
  • Future Directions: Algorithm simplification (leading to PPO), applying TRPO/related ideas to more complex perception and control problems, combining with model-based approaches.