Revision as of 18:58, 28 February 2025 edit Atcold (talk \| contribs) 144 edits m →Overview: Partial notation fix Tag: Visual edit ← Previous edit		Revision as of 07:05, 13 March 2025 edit undo Dhtwiki (talk \| contribs) Extended confirmed users, Mass message senders 56,534 edits correcting according to "Articles for possible copyedit from 2025-2-20 dump" – "is,This", "gradientwhich", "estimatorand", "sinceby", "tryinguntil", "advantageunder", "advantagewhere" Next edit →
Line 31: \sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_t \mid S_t)\; \sum_{t \in 0:T} (\gamma^t R_t) \Big\|S_0 = s_0 \right]</math> which can be improved via the "causality trick"<ref>{{Cite journal \|last1=Sutton \|first1=Richard S \|last2=McAllester \|first2=David \|last3=Singh \|first3=Satinder \|last4=Mansour \|first4=Yishay \|date=1999 \|title=Policy Gradient Methods for Reinforcement Learning with Function Approximation \|url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref><math display="block"> \nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_t\mid S_t)\sum_{\tau \in t:T} (\gamma^\tau R_\tau) \Big\|S_0 = s_0 \right] Line 126: The algorithm uses the modified gradient estimator<math display="block">g_t \leftarrow \frac 1N \sum_{k=1}^N \left[\sum_{j\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{j,k}\| S_{i,k})\left(\sum_{i \in j:T} (\gamma^i R_{i,k}) - b_t(S_{j,k})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>b_t = 0</math>. === Actor-critic methods === Line 135: The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j\| S_j) \cdot Q^{\pi_\theta}(S_j, A_j) \Big\|S_0 = s_0 \right]</math> by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j\| S_j) Line 171: \max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\ \bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot \| s) \\| \pi_{\theta_{t}}(\cdot \| s) )]</math> This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation === Line 178: </math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block"> F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a\|s) \left(\nabla_\theta \ln \pi_\theta(a\|s)\right)^T \right] </math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block"> \theta_{t+1} = \theta_t + \alpha F(\theta_t)^{-1} \nabla_\theta J(\theta_t) </math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>. Line 227: * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by repeatedly trying<math display="block"> \theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots </math> until a <math>\theta_{t+1}</math> is found that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon </math> and results in a higher <math> L(\theta_{t+1}, \theta_t) \geq L(\theta_t, \theta_t) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient. Line 238: Specifically, instead of maximizing the surrogate advantage<math display="block"> \max_\theta L(\theta, \theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right] </math> under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block"> \max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \begin{cases} Line 246: \end{cases} \right] </math> and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual. In words, gradient-ascending the new surrogate advantage function means that, at some state <math> Line 319: {{Anchor\|GRPO}} The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s </math>, it samples multiple actions <math>a_1, \dots, a_G </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation \|last1=Shao \|first1=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|arxiv=2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s, a_{j}) = \frac{r(s, a_{j}) - \mu}{\sigma} </math> where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards. Then, it maximizes the PPO objective, averaged over all actions:<math display="block">

Policy gradient method: Difference between revisions