Content deleted Content added
m →Overview: Partial notation fix |
correcting according to "Articles for possible copyedit from 2025-2-20 dump" – "is,This", "gradientwhich", "estimatorand", "sinceby", "tryinguntil", "advantageunder", "advantagewhere" |
||
Line 31:
\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_t \mid S_t)\; \sum_{t \in 0:T} (\gamma^t R_t)
\Big|S_0 = s_0
\right]</math> which can be improved via the "causality trick"<ref>{{Cite journal |last1=Sutton |first1=Richard S |last2=McAllester |first2=David |last3=Singh |first3=Satinder |last4=Mansour |first4=Yishay |date=1999 |title=Policy Gradient Methods for Reinforcement Learning with Function Approximation |url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref><math display="block">
\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_t\mid S_t)\sum_{\tau \in t:T} (\gamma^\tau R_\tau)
\Big|S_0 = s_0 \right]
Line 126:
The algorithm uses the modified gradient estimator<math display="block">g_t \leftarrow
\frac 1N \sum_{k=1}^N \left[\sum_{j\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{j,k}| S_{i,k})\left(\sum_{i \in j:T} (\gamma^i R_{i,k}) - b_t(S_{j,k})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>b_t = 0</math>.
=== Actor-critic methods ===
Line 135:
The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j| S_j)
\cdot Q^{\pi_\theta}(S_j, A_j)
\Big|S_0 = s_0 \right]</math> by a similar argument using the tower law.
Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j| S_j)
Line 171:
\max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\
\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot | s) \| \pi_{\theta_{t}}(\cdot | s) )]</math> This ensures updates are invariant to invertible affine parameter transformations.
=== Fisher information approximation ===
Line 178:
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{t+1} = \theta_t + \alpha F(\theta_t)^{-1} \nabla_\theta J(\theta_t)
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>.
Line 227:
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by repeatedly trying<math display="block">
\theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots
</math> until a <math>\theta_{t+1}</math> is found that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon </math> and results in a higher <math>
L(\theta_{t+1}, \theta_t) \geq L(\theta_t, \theta_t)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
Line 238:
Specifically, instead of maximizing the surrogate advantage<math display="block">
\max_\theta L(\theta, \theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]
</math> under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block">
\max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
\begin{cases}
Line 246:
\end{cases}
\right]
</math> and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.
In words, gradient-ascending the new surrogate advantage function means that, at some state <math>
Line 319:
{{Anchor|GRPO}}
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s </math>, it samples multiple actions <math>a_1, \dots, a_G </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation |last1=Shao |first1=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |arxiv=2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s, a_{j}) = \frac{r(s, a_{j}) - \mu}{\sigma} </math> where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.
Then, it maximizes the PPO objective, averaged over all actions:<math display="block">
|