Policy gradient method: Difference between revisions

Content deleted Content added
Group Relative Policy Optimization (GRPO): Fixed condition on the advantage.
Tags: Mobile edit Mobile web edit
Proximal Policy Optimization (PPO): Fixed Inequality on Advantage Function
Tags: Mobile edit Mobile web edit
Line 245:
\min \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0
\\
\max \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) >< 0
\end{cases}
\right]