Revision as of 21:17, 28 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Group Relative Policy Optimization (GRPO) Tag: Visual edit ← Previous edit		Revision as of 21:18, 28 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Motivation Tag: Visual edit Next edit →
Line 173: \max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\ \bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution ~~when~~ under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot \| s) \\| \pi_{\theta_{t}}(\cdot \| s) )]</math>This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation ===

Policy gradient method: Difference between revisions