Content deleted Content added
→Actor-critic methods: TRPO |
|||
Line 208:
</math>As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t),
\end{aligned}
Line 214:
where:
* <math>g = \nabla_\theta \mathcal{L}(\theta_t, \theta) \big|_{\theta = \theta_t}</math> is the policy gradient.
* <math>
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{t+1} = \theta_t + \sqrt{\frac{2\delta}{g^T
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
* Use conjugate gradient method to solve for <math>
x
</math> in <math>
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by <math display="block">
\theta_{t+1} = \theta_t + \alpha^j \sqrt{\frac{2\
</math>where <math>\alpha \in (0,1)</math> is a backtracking coefficient, and <math>j</math> is the smallest integer satisfying the KL constraint and a positive surrogate advantage.
|