Revision as of 05:05, 27 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits infobox Tag: Visual edit ← Previous edit		Revision as of 07:55, 27 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Group Relative Policy Optimization (GRPO): ds Tag: Visual edit Next edit →
Line 311: {{Anchor\|GRPO}} The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation \|last=Shao \|first=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|url=https://arxiv.org/abs/2402.03300 \|publisher=arXiv \|doi=10.48550/arXiv.2402.03300 \|id=arXiv:2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards. Then, it maximizes the PPO objective, averaged over all actions:<math display="block"> Line 322: \right] </math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse. GRPO was first proposed in the context of [[reasoning language model]] by researchers at [[DeepSeek]].<ref name=":1" /> == See also ==

Policy gradient method: Difference between revisions