Policy gradient method: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 01:05, 22 February 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Group Relative Policy Optimization (GRPO) Tag: Visual edit ← Previous edit		Latest revision as of 20:12, 9 July 2025 edit undo 14.203.192.131 (talk) I fixed the formula so it would express "the total reward from time $
(16 intermediate revisions by 6 users not shown)
Line 6: == Overview == In policy-based RL, the actor is a parameterized policy function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot \|\mid s)</math>. If the action space is discrete, then <math>\sum_{a} \pi_\theta(a \|\mid s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a \|\mid s) da\mathrm{d}a = 1</math>. The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \gamma^t R_t \Big\| S_0 = s_0 \right]</math>where <math> ~~J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{i\in 0:T} \gamma^i R_i \Big\| S_0 = s_0 \right]~~ ~~</math>where <math>~~ \gamma </math> is the [[discount factor]], <math> Line 31 ⟶ 29: === Policy gradient === The '''REINFORCE algorithm''' was the first policy gradient method.<ref>{{Cite journal \|last=Williams \|first=Ronald J. \|date=May 1992 \|title=Simple statistical gradient-following algorithms for connectionist reinforcement learning \|url=http://link.springer.com/10.1007/BF00992696 \|journal=Machine Learning \|language=en \|volume=8 \|issue=3–4 \|pages=229–256 \|doi=10.1007/BF00992696 \|issn=0885-6125}}</ref> It is based on the identity for the policy gradient<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[ \sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j\|~~A_t ~~S_j~~\mid S_t)\; \sum_{it \in 0:T} (\gamma^it ~~R_i~~R_t) \Big\|S_0 = s_0 \right]</math> which can be improved via the "causality trick"<ref>{{Cite journal \|last1=Sutton \|first1=Richard S \|last2=McAllester \|first2=David \|last3=Singh \|first3=Satinder \|last4=Mansour \|first4=Yishay \|date=1999 \|title=Policy Gradient Methods for Reinforcement Learning with Function Approximation \|url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref><math display="block"> \nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j\|~~A_t\mid ~~S_j~~S_t)\sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) \Big\|S_0 = s_0 \right] </math> Line 49 ⟶ 47: }} {{hidden begin\|style=width:100%\|ta1=center\|border=1px #aaa solid\|title=~~Proof~~Proofs}} {{Math proof\|title=Proof of the lemma\|proof= Use the [[reparameterization trick#REINFORCE estimator\|reparameterization trick]]. Line 84 ⟶ 82: \end{aligned} </math> }} ~~}}{{hidden end}}~~ {{Math proof\|title=Proof of the two identities\|proof=▼ ~~{{hidden begin\|style=width:100%\|ta1=center\|border=1px #aaa solid\|title=Proof}}~~ ▲{{Math proof\|title=Proof\|proof= Applying the [[reparameterization trick#REINFORCE estimator\|reparameterization trick]], Line 106 ⟶ 102: }} {{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block"> \nabla_\theta J(\theta) \approx \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_{jt,kn}\|\mid S_{jt,kn})\sum_{i\tau \in jt:T} (\gamma^i{\tau-t} R_{i\tau ,kn}) \right] </math>where the index <math>kn</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>. The [[Score (statistics)\|score function]] <math>\nabla_\theta \ln \pi_\theta (A_t \|\mid S_t)</math> can be interpreted as the direction in the parameter space that increases the probability of taking action <math>A_t</math> in state <math>S_t</math>. The policy gradient, then, is a [[weighted average]] of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === Line 115 ⟶ 111: # Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function. # Compute the policy gradient estimation: <math>~~g_t~~g_i \leftarrow \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}\|\mid S_{jt,kn})\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) \right]</math> # Update the policy by gradient ascent: <math>\theta_{ti+1} \leftarrow \~~theta_t~~theta_i + \~~alpha_t~~alpha_i ~~g_t~~g_i</math> Here, <math>\~~alpha_t~~alpha_i</math> is the learning rate at update step <math>ti</math>. == Variance reduction == REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE ~~has~~have been introduced, under the title of '''[[variance reduction]]'''. === REINFORCE with baseline === A common way for reducing variance is the '''REINFORCE with baseline''' algorithm, based on the following identity:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) - b(~~S_j~~S_t)\right) \Big\|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator<math display="block">~~g_t~~g_i \leftarrow \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}\| S_{it,kn})\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) - ~~b_t~~b_i(S_{jt,kn})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>~~b_t~~b_i =\equiv 0</math>. === Actor-critic methods === {{Main\|Actor-critic algorithm}} If <math display="inline">~~b_t~~b_i</math> is chosen well, such that <math display="inline">~~b_t~~b_i(~~S_j~~S_t) \approx \sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) = \gamma^jt V^{\pi_{\~~theta_t~~theta_i}}(~~S_j~~S_t)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\~~theta_t~~theta_i}}(~~S_j~~S_t)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) - \gamma^jt V^{\pi_\theta}(~~S_j~~S_t)\right) \Big\|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\~~theta_t~~theta_i}}(~~S_j~~S_t)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic. The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t) \cdot Q^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t) \Big\|S_0 = s_0 \right]</math> by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t) \cdot A^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t) \Big\|S_0 = s_0 \right]</math>In summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t) \cdot \~~Psi_j~~Psi_t \Big\|S_0 = s_0 \right]</math> where <math display="inline">\~~Psi_j~~Psi_t</math> is any linear sum of the following terms: * <math display="inline">\sum_{0 \leq i\tau\leq T} (\gamma^i\tau ~~R_i~~R_\tau)</math>: never used. * <math display="inline">\gamma^jt\sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} ~~R_i~~R_\tau)</math>: used by the REINFORCE algorithm. * <math display="inline">\gamma^jt \sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} ~~R_i~~R_\tau) - b(~~S_j~~S_t) </math>: used by the REINFORCE with baseline algorithm. * <math display="inline">\gamma^jt \left(~~R_j~~R_t + \gamma V^{\pi_\theta}( S_{jt+1}) - V^{\pi_\theta}( S_{jt})\right)</math>: 1-step TD learning. * <math display="inline">\gamma^jt Q^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t)</math>. * <math display="inline">\gamma^jt A^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t)</math>. Some more possible <math display="inline">\~~Psi_j~~Psi_t</math> are as follows, with very similar proofs. * <math display="inline">\gamma^jt \left(~~R_j~~R_t + \gamma R_{jt+1} + \gamma^2 V^{\pi_\theta}( S_{jt+2}) - V^{\pi_\theta}( S_{jt})\right)</math>: 2-step TD learning. * <math display="inline">\gamma^jt \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: n-step TD learning. * <math display="inline">\gamma^jt \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|last1=Schulman \|first1=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|arxiv=1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones. == Natural policy gradient == Line 164 ⟶ 160: === Motivation === Standard policy gradient updates <math>\theta_{ti+1} = \~~theta_t~~theta_i + \alpha \nabla_\theta J(\~~theta_t~~theta_i)</math> solve a constrained optimization problem:<math display="block"> \begin{cases} \max_{\theta_{ti+1}} J(\~~theta_t~~theta_i) + (\theta_{ti+1} - \~~theta_t~~theta_i)^T \nabla_\theta J(\~~theta_t~~theta_i)\\ \\|\theta_{ti+1} - \theta_{ti}\\|\leq \alpha \cdot \\|\nabla_\theta J(\~~theta_t~~theta_i)\\| \end{cases} </math> While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\\|\theta_{ti+1} - \~~theta_t~~theta_i\\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases} \max_{\theta_{ti+1}} J(\~~theta_t~~theta_i) + (\theta_{ti+1} - \~~theta_t~~theta_i)^T \nabla_\theta J(\~~theta_t~~theta_i)\\ \bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) \leq \epsilon \end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\~~theta_t~~theta_i}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) := \mathbb E_{s \sim \pi_{\~~theta_t~~theta_i}}[D_{KL}( \pi_{\theta_{ti+1}}(\cdot \| s) \\| \pi_{\theta_{ti}}(\cdot \| s) )]</math> This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation === For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block"> \bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) \approx \frac{1}{2} (\theta_{ti+1} - \~~theta_t~~theta_i)^T F(\~~theta_t~~theta_i) (\theta_{ti+1} - \~~theta_t~~theta_i) </math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block"> F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a\|s) \left(\nabla_\theta \ln \pi_\theta(a\|s)\right)^T \right] </math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block"> \theta_{ti+1} = \~~theta_t~~theta_i + \alpha F(\~~theta_t~~theta_i)^{-1} \nabla_\theta J(\~~theta_t~~theta_i) </math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\~~theta_t~~theta_i))^T F(\~~theta_t~~theta_i)^{-1} \nabla_\theta J(\~~theta_t~~theta_i)}}</math>. Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations. == Trust Region Policy Optimization (TRPO) == {{Anchor\|TRPO}} '''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal \|last1=Schulman \|first1=John \|last2=Levine \|first2=Sergey \|last3=Moritz \|first3=Philipp \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter \|date=2015-07-06 \|title=Trust region policy optimization \|url=https://dl.acm.org/doi/10.5555/3045118.3045319 \|journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 \|series=ICML'15 \|___location=Lille, France \|publisher=JMLR.org \|pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO ~~ensures~~improves ~~stable policy improvements by limiting~~upon the ~~KL divergence between successive policies, addressing key challenges in~~ natural policy gradient ~~methods~~method. ~~TRPO~~The ~~builds~~natural ongradient ~~the~~descent ~~natural~~is ~~policy~~theoretically ~~gradient~~optimal, by''if'' ~~incorporating~~the aobjective ~~trust~~is ~~region~~truly ~~constraint.~~a ~~While~~quadratic ~~the~~function, ~~natural~~but ~~gradient~~this ~~provides~~is aonly ~~theoretically~~an ~~optimal direction,~~approximation. TRPO's line search and KL constraint ~~mitigate~~attempts ~~errors~~to ~~from~~restrict ~~Taylor~~the ~~approximations,~~solution to within a "trust region" in which this approximation ~~ensuring~~does ~~monotonic~~not ~~policy~~break ~~improvement~~down. This makes TRPO more robust in practice~~, particularly for high-dimensional policies~~. === Formulation === Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block"> \begin{cases} \max_{\theta} L(\theta, \~~theta_t~~theta_i)\\ \bar{D}_{KL}(\pi_{\theta} \\| \pi_{\theta_{ti}}) \leq \epsilon \end{cases} </math>where * <math>L(\theta, \~~theta_t~~theta_i) = \mathbb{E}_{s, a \sim \pi_{\~~theta_t~~theta_i}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\~~theta_t~~theta_i}(a\|s)} A^{\pi_{\~~theta_t~~theta_i}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\~~theta_k~~theta_i}</math>. * <math>\epsilon</math> is the trust region radius. Note that in general, other surrogate advantages are possible:<math display="block">L(\theta, \~~theta_t~~theta_i) = \mathbb{E}_{s, a \sim \pi_{\~~theta_t~~theta_i}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\~~theta_t~~theta_i}(a\|s)}\Psi^{\pi_{\~~theta_t~~theta_i}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>. The surrogate advantage <math>L(\theta, \theta_t) Line 207 ⟶ 204: \nabla_\theta L(\theta, \theta_t) </math>''' equals the policy gradient derived from the advantage function: <math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a \| s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \~~theta_t~~theta_i</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective. As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block"> \begin{aligned} L(\theta, \~~theta_t~~theta_i) &\approx g^T (\theta - \~~theta_t~~theta_i), \\ \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\~~theta_t~~theta_i}) &\approx \frac{1}{2} (\theta - \~~theta_t~~theta_i)^T H (\theta - \~~theta_t~~theta_i), \end{aligned} </math> where: * <math>g = \nabla_\theta L(\theta, \~~theta_t~~theta_i) \big\|_{\theta = \~~theta_t~~theta_i}</math> is the policy gradient. * <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\~~theta_t~~theta_i}) \big\|_{\theta = \~~theta_t~~theta_i}</math> is the Fisher information matrix. This reduces the problem to a quadratic optimization, yielding the natural policy gradient update: <math display="block"> \theta_{ti+1} = \~~theta_t~~theta_i + \sqrt{\frac{2\epsilon}{g^T F^{-1} g}} F^{-1} g. </math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications: Line 227 ⟶ 224: x </math> in <math>Fx = g</math> iteratively without explicit matrix inversion. * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement. byThat ~~repeatedly~~is, ~~trying~~it tests each of the following test-solutions<math display="block"> \theta_{ti+1} = \~~theta_t~~theta_i + \sqrt{\frac{2\epsilon}{x^T F x}} x, \~~theta_t~~; \theta_i + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_i + \alpha^2 \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \dots </math> until ~~a <math>\theta_{t+1}</math>~~it isfinds ~~found~~one that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) \leq \epsilon </math> and results in a higher <math> L(\theta_{ti+1}, \~~theta_t~~theta_i) \geq L(\~~theta_t~~theta_i, \~~theta_t~~theta_i) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient. Line 240 ⟶ 237: Specifically, instead of maximizing the surrogate advantage<math display="block"> \max_\theta L(\theta, \theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right] </math> under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block"> \max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \begin{cases} Line 248 ⟶ 245: \end{cases} \right] </math> and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual. In words, gradient-ascending the new surrogate advantage function means that, at some state <math> Line 321 ⟶ 318: {{Anchor\|GRPO}} The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s </math>, it samples multiple actions <math>a_1, \dots, a_G </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation \|last1=Shao \|first1=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|arxiv=2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s, a_{j}) = \frac{r(s, a_{j}) - \mu}{\sigma} </math> where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards. Then, it maximizes the PPO objective, averaged over all actions:<math display="block">