Revision as of 13:57, 28 January 2025 edit 2409:408c:ad1f:c2f:6a1e:9de8:7740:8285 (talk) →Group Relative Policy Optimization (GRPO): Fixed condition on the advantage. Tags: Mobile edit Mobile web edit ← Previous edit		Revision as of 13:59, 28 January 2025 edit undo 2409:408c:ad1f:c2f:6a1e:9de8:7740:8285 (talk) →Proximal Policy Optimization (PPO): Fixed Inequality on Advantage Function Tags: Mobile edit Mobile web edit Next edit →
Line 245: \min \left(\frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0 \\ \max \left(\frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) >< 0 \end{cases} \right]

Policy gradient method: Difference between revisions