Content deleted Content added
→Trust Region Policy Optimization (TRPO): "surrogate" |
|||
Line 207:
\nabla_\theta L(\theta_t, \theta)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[
▲\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j | S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big| S_0 = s_0 \right] = \nabla_\theta L(\theta, \theta_t)
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
|