Revision as of 00:55, 28 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Proximal Policy Optimization (PPO): kl penalty estimator Tag: Visual edit ← Previous edit		Revision as of 13:57, 28 January 2025 edit undo 2409:408c:ad1f:c2f:6a1e:9de8:7740:8285 (talk) →Group Relative Policy Optimization (GRPO): Fixed condition on the advantage. Tags: Mobile edit Mobile web edit Next edit →
Line 328: \min \left(\frac{\pi_\theta(a_i\|s)}{\pi_{\theta_t}(a_i\|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) > 0 \\ \max \left(\frac{\pi_\theta(a_i\|s)}{\pi_{\theta_t}(a_i\|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) >< 0 \end{cases} \right]

Policy gradient method: Difference between revisions