Policy gradient method: Difference between revisions

Content deleted Content added
Group Relative Policy Optimization (GRPO): Fixed condition on the advantage.
Tags: Mobile edit Mobile web edit
Line 328:
\min \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) > 0
\\
\max \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) >< 0
\end{cases}
\right]