Revision as of 00:36, 28 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Group Relative Policy Optimization (GRPO) Tag: Visual edit ← Previous edit		Revision as of 00:53, 28 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Proximal Policy Optimization (PPO) Tag: Visual edit Next edit →
Line 310: If there is a reference policy <math> \pi_{\text{ref}} </math> that the trained policy should not diverge too far from, then additional KL divergence penalty can be added:<math display="block">-\beta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[\log\left(\frac{\pi_{\theta}(a\|s)}{\pi_{\text{ref}}(a\|s)}\right) \right]</math>where <math> <math display="block">\mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[-\beta \log\left(\frac{\pi_{\theta}(a\|s)}{\pi_{\text{ref}}(a\|s)}\right) \right]</math>▼ where <math>▼ \beta </math> adjusts the strength of the penalty. This has been used in training [[Reasoning language model\|reasoning language models]] with [[reinforcement learning from human feedback]].<ref name="summarizationpaper">{{cite journal \|author=Nisan Stiennon \|author2=Long Ouyang \|author3=Jeffrey Wu \|author4=Daniel Ziegler \|author5=Ryan Lowe \|author6=Chelsea Voss \|author7=Alec Radford \|author8=Dario Amodei \|author9=Paul F. Christiano \|date=2020 \|title=Learning to summarize with human feedback \|url=https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=33}}</ref> The KL divergence penalty can be estimated with lower variance using the equivalent form (see [[f-divergence]] for details):<ref name=":1" /><math display="block">-\beta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ ▲~~<math display="block">\mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[-\beta~~ \log\left(\frac{\pi_{\theta}(a\|s)}{\pi_{\text{ref}}(a\|s)}\right) ~~\right]</math>~~ + \frac{\pi_{\text{ref}}(a\|s)} {\pi_{\theta}(a\|s)} -1 ▲~~where~~ \right]</math> === Group Relative Policy Optimization (GRPO) ===

Policy gradient method: Difference between revisions