Content deleted Content added
Line 218:
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{t+1} = \theta_t + \sqrt{\frac{2\
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
Line 224:
x
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by repeatedly trying<math display="block">
\theta_{t+1} = \theta_t + \
</math>until one is found that both satisfies the KL constraint and results in an improvement in <math>
L(\theta_t, \theta)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />
|