Policy gradient method: Difference between revisions

Content deleted Content added
Line 184:
{{Anchor|TRPO}}
 
'''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO ensuresimproves stable policy improvements by limitingupon the KL divergence between successive policies, addressing key challenges in natural policy gradient methodsmethod.
 
TRPO builds on the natural policy gradient by incorporating a trust region constraint. The natural gradient descent is theoretically optimal, ''if'' the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint attempts to restrict the solution to within a "trust region" in which this approximation does not break down. This makes TRPO more robust in practice, particularly for high-dimensional policies.
 
=== Formulation ===