Content deleted Content added
→Proximal Policy Optimization (PPO): kl penalty estimator |
→Group Relative Policy Optimization (GRPO): Fixed condition on the advantage. Tags: Mobile edit Mobile web edit |
||
Line 328:
\min \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) > 0
\\
\max \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i)
\end{cases}
\right]
|