Content deleted Content added
Line 333:
</math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.
As before, the KL penalty term can be applied to encourage the trained policy to stay close to a reference policy.
== See also ==
|