Content deleted Content added
Line 310:
If there is a reference policy <math>
\pi_{\text{ref}}
</math> that the trained policy should not diverge too far from, then additional KL divergence penalty can be added:<math display="block">-\beta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[\log\left(\frac{\pi_{\theta}(a|s)}{\pi_{\text{ref}}(a|s)}\right) \right]</math>where <math>
<math display="block">\mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[-\beta \log\left(\frac{\pi_{\theta}(a|s)}{\pi_{\text{ref}}(a|s)}\right) \right]</math>▼
where <math>▼
\beta
</math> adjusts the strength of the penalty. This has been used in training [[Reasoning language model|reasoning language models]] with [[reinforcement learning from human feedback]].<ref name="summarizationpaper">{{cite journal |author=Nisan Stiennon |author2=Long Ouyang |author3=Jeffrey Wu |author4=Daniel Ziegler |author5=Ryan Lowe |author6=Chelsea Voss |author7=Alec Radford |author8=Dario Amodei |author9=Paul F. Christiano |date=2020 |title=Learning to summarize with human feedback |url=https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html |journal=Advances in Neural Information Processing Systems |language=en |volume=33}}</ref>
The KL divergence penalty can be estimated with lower variance using the equivalent form (see [[f-divergence]] for details):<ref name=":1" /><math display="block">-\beta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
▲
+ \frac{\pi_{\text{ref}}(a|s)} {\pi_{\theta}(a|s)}
-1
=== Group Relative Policy Optimization (GRPO) ===
|