Revision as of 13:59, 28 January 2025 edit 2409:408c:ad1f:c2f:6a1e:9de8:7740:8285 (talk) →Proximal Policy Optimization (PPO): Fixed Inequality on Advantage Function Tags: Mobile edit Mobile web edit ← Previous edit		Revision as of 21:17, 28 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Group Relative Policy Optimization (GRPO) Tag: Visual edit Next edit →
Line 321: {{Anchor\|GRPO}} The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>~~s_i~~s </math>, it samples multiple actions <math>~~a_{i,1}~~a_1, \dots, ~~a_{i,G}~~a_G </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation \|last1=Shao \|first1=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|arxiv=2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(~~s_{i}~~s, a_{i,j}) = \frac{r(~~s_i~~s, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards. Then, it maximizes the PPO objective, averaged over all actions:<math display="block">

Policy gradient method: Difference between revisions