Policy gradient method: Difference between revisions

Content deleted Content added
mNo edit summary
Line 184:
{{Anchor|TRPO}}
 
'''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods.
 
TRPO builds on the natural policy gradient by incorporating a trust region constraint. While theThe natural gradient provides a theoretically optimal direction under the assumption that the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint mitigateattempts errorsto fromrestrict Taylorthe approximations,solution to within a "trust region" in which this approximation ensuringdoes monotonicnot policybreak improvementdown. This makes TRPO more robust in practice, particularly for high-dimensional policies.
 
=== Formulation ===