Revision as of 16:27, 27 January 2025 edit Citation bot (talk \| contribs) Bots 5,866,970 edits Removed URL that duplicated identifier. \| Use this bot. Report bugs. \| Suggested by Abductive \| Category:Reinforcement learning \| #UCB_Category 12/14 ← Previous edit		Revision as of 00:36, 28 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Proximal Policy Optimization (PPO) Tag: Visual edit Next edit →
Line 307: \theta_t </math> is necessary. If there is a reference policy <math> \pi_{\text{ref}} </math> that the trained policy should not diverge too far from, then additional KL divergence penalty can be added: <math display="block">\mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[-\beta \log\left(\frac{\pi_{\theta}(a\|s)}{\pi_{\text{ref}}(a\|s)}\right) \right]</math> where <math> \beta </math> adjusts the strength of the penalty. This has been used in training [[Reasoning language model\|reasoning language models]] with [[reinforcement learning from human feedback]].<ref name="summarizationpaper">{{cite journal \|author=Nisan Stiennon \|author2=Long Ouyang \|author3=Jeffrey Wu \|author4=Daniel Ziegler \|author5=Ryan Lowe \|author6=Chelsea Voss \|author7=Alec Radford \|author8=Dario Amodei \|author9=Paul F. Christiano \|date=2020 \|title=Learning to summarize with human feedback \|url=https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=33}}</ref> === Group Relative Policy Optimization (GRPO) === Line 323 ⟶ 333: </math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse. As before, the KL penalty term can be applied to encourage the trained policy to stay close to a reference policy. This form of GRPO was first proposed in the context of training [[reasoning language model]] by researchers at [[DeepSeek]].<ref name=":1" /> == See also ==

Policy gradient method: Difference between revisions