Revision as of 02:41, 13 April 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits mNo edit summary Tag: 2017 wikitext editor ← Previous edit		Revision as of 02:43, 13 April 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO) Tag: Visual edit Next edit →
Line 184: {{Anchor\|TRPO}} '''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal \|last1=Schulman \|first1=John \|last2=Levine \|first2=Sergey \|last3=Moritz \|first3=Philipp \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter \|date=2015-07-06 \|title=Trust region policy optimization \|url=https://dl.acm.org/doi/10.5555/3045118.3045319 \|journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 \|series=ICML'15 \|___location=Lille, France \|publisher=JMLR.org \|pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods. TRPO builds on the natural policy gradient by incorporating a trust region constraint. ~~While the~~The natural gradient provides a theoretically optimal direction under the assumption that the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint ~~mitigate~~attempts ~~errors~~to ~~from~~restrict ~~Taylor~~the ~~approximations,~~solution to within a "trust region" in which this approximation ~~ensuring~~does ~~monotonic~~not ~~policy~~break ~~improvement~~down. This makes TRPO more robust in practice, particularly for high-dimensional policies. === Formulation ===

Policy gradient method: Difference between revisions