Policy gradient method: Difference between revisions

Content deleted Content added
infobox
Line 311:
{{Anchor|GRPO}}
 
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation |last=Shao |first=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |url=https://arxiv.org/abs/2402.03300 |publisher=arXiv |doi=10.48550/arXiv.2402.03300 |id=arXiv:2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.
 
Then, it maximizes the PPO objective, averaged over all actions:<math display="block">
Line 322:
\right]
</math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.
 
GRPO was first proposed in the context of [[reasoning language model]] by researchers at [[DeepSeek]].<ref name=":1" />
 
== See also ==