Revision as of 00:48, 27 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO): "surrogate" Tag: Visual edit ← Previous edit		Revision as of 00:50, 27 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Formulation Tag: Visual edit Next edit →
Line 207: \nabla_\theta L(\theta_t, \theta) </math>''' equals the policy gradient derived from the advantage function: <math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[~~\sum_{j=0}^T~~ \nabla_\theta \ln \pi_\theta(~~A_j~~a \| ~~S_j~~s) \cdot A^{\pi_\theta}(~~S_j~~s, ~~A_j~~a) ~~\Big\| S_0 = s_0~~ \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective. ▼ ~~<math display="block">~~ ▲\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j \| S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big\| S_0 = s_0 \right] = \nabla_\theta L(\theta, \theta_t) ~~</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.~~ As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">

Policy gradient method: Difference between revisions