Revision as of 00:31, 27 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Group Relative Policy Optimization (GRPO): anchors Tag: Visual edit ← Previous edit		Revision as of 00:48, 27 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO): "surrogate" Tag: Visual edit Next edit →
Line 208: </math>''' equals the policy gradient derived from the advantage function: <math display="block"> \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j \| S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big\| S_0 = s_0 \right] = \nabla_\theta L(\theta, \theta_t) </math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective. As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block"> \begin{aligned} L(\theta_t, \theta) &\approx g^T (\theta - \theta_t), \\

Policy gradient method: Difference between revisions