Revision as of 02:43, 13 April 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO) Tag: Visual edit ← Previous edit		Revision as of 02:45, 13 April 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Trust Region Policy Optimization (TRPO) Tag: Visual edit Next edit →
Line 184: {{Anchor\|TRPO}} '''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal \|last1=Schulman \|first1=John \|last2=Levine \|first2=Sergey \|last3=Moritz \|first3=Philipp \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter \|date=2015-07-06 \|title=Trust region policy optimization \|url=https://dl.acm.org/doi/10.5555/3045118.3045319 \|journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 \|series=ICML'15 \|___location=Lille, France \|publisher=JMLR.org \|pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO ~~ensures~~improves ~~stable policy improvements by limiting~~upon the ~~KL divergence between successive policies, addressing key challenges in~~ natural policy ~~gradient methods~~method. ~~TRPO builds on the natural policy gradient by incorporating a trust region constraint.~~ The natural gradient descent is theoretically optimal, ''if'' the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint attempts to restrict the solution to within a "trust region" in which this approximation does not break down. This makes TRPO more robust in practice, particularly for high-dimensional policies. === Formulation ===

Policy gradient method: Difference between revisions