Policy gradient method: Difference between revisions

Content deleted Content added
Line 207:
\nabla_\theta L(\theta_t, \theta)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_ja | S_js) \cdot A^{\pi_\theta}(S_js, A_ja) \Big| S_0 = s_0 \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
<math display="block">
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j | S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big| S_0 = s_0 \right] = \nabla_\theta L(\theta, \theta_t)
</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
 
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">