Policy gradient method: Difference between revisions

Content deleted Content added
Line 208:
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j | S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big| S_0 = s_0 \right] = \nabla_\theta L(\theta, \theta_t)
</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.

As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\theta_t, \theta) &\approx g^T (\theta - \theta_t), \\