Revision as of 21:20, 28 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits unify notation Tag: 2017 wikitext editor ← Previous edit		Revision as of 01:05, 22 February 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Group Relative Policy Optimization (GRPO) Tag: Visual edit Next edit →
Line 333: </math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse. As before, the KL penalty term can be applied to encourage the trained policy to stay close to a reference policy. GRPO was first proposed in the context of training [[reasoning language model\|reasoning language models]] by researchers at [[DeepSeek]].<ref name=":1" /> == See also ==

Policy gradient method: Difference between revisions