Policy gradient method: Difference between revisions

Content deleted Content added
Proximal Policy Optimization (PPO): Fixed Inequality on Advantage Function
Tags: Mobile edit Mobile web edit
Line 321:
{{Anchor|GRPO}}
 
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_is </math>, it samples multiple actions <math>a_{i,1}a_1, \dots, a_{i,G}a_G </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation |last1=Shao |first1=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |arxiv=2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}s, a_{i,j}) = \frac{r(s_is, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.
 
Then, it maximizes the PPO objective, averaged over all actions:<math display="block">