Content deleted Content added
→Group Relative Policy Optimization (GRPO): Fixed condition on the advantage. Tags: Mobile edit Mobile web edit |
→Proximal Policy Optimization (PPO): Fixed Inequality on Advantage Function Tags: Mobile edit Mobile web edit |
||
Line 245:
\min \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0
\\
\max \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a)
\end{cases}
\right]
|