Revision as of 01:05, 22 February 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Group Relative Policy Optimization (GRPO) Tag: Visual edit ← Previous edit		Revision as of 18:58, 28 February 2025 edit undo Atcold (talk \| contribs) 144 edits m →Overview: Partial notation fix Tag: Visual edit Next edit →
Line 6: == Overview == In policy-based RL, the actor is a parameterized policy function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot \|\mid s)</math>. If the action space is discrete, then <math>\sum_{a} \pi_\theta(a \|\mid s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a \|\mid s) da\mathrm{d}a = 1</math>. The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \gamma^t R_t \Big\| S_0 = s_0 \right]</math>where <math> ~~J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{i\in 0:T} \gamma^i R_i \Big\| S_0 = s_0 \right]~~ ~~</math>where <math>~~ \gamma </math> is the [[discount factor]], <math> Line 31 ⟶ 29: === Policy gradient === The '''REINFORCE algorithm''' was the first policy gradient method.<ref>{{Cite journal \|last=Williams \|first=Ronald J. \|date=May 1992 \|title=Simple statistical gradient-following algorithms for connectionist reinforcement learning \|url=http://link.springer.com/10.1007/BF00992696 \|journal=Machine Learning \|language=en \|volume=8 \|issue=3–4 \|pages=229–256 \|doi=10.1007/BF00992696 \|issn=0885-6125}}</ref> It is based on the identity for the policy gradient<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[ \sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j\|~~A_t ~~S_j~~\mid S_t)\; \sum_{it \in 0:T} (\gamma^it ~~R_i~~R_t) \Big\|S_0 = s_0 \right]</math>which can be improved via the "causality trick"<ref>{{Cite journal \|last1=Sutton \|first1=Richard S \|last2=McAllester \|first2=David \|last3=Singh \|first3=Satinder \|last4=Mansour \|first4=Yishay \|date=1999 \|title=Policy Gradient Methods for Reinforcement Learning with Function Approximation \|url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref><math display="block"> \nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j\|~~A_t\mid ~~S_j~~S_t)\sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) \Big\|S_0 = s_0 \right] </math> Line 106 ⟶ 104: }} {{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block"> \nabla_\theta J(\theta) \approx \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_{jt,kn}\|\mid S_{jt,kn})\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau ,kn}) \right] </math>where the index <math>kn</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>. The [[Score (statistics)\|score function]] <math>\nabla_\theta \ln \pi_\theta (A_t \|\mid S_t)</math> can be interpreted as the direction in the parameter space that increases the probability of taking action <math>A_t</math> in state <math>S_t</math>. The policy gradient, then, is a [[weighted average]] of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === Line 115 ⟶ 113: # Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function. # Compute the policy gradient estimation: <math>g_t \leftarrow \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}\|\mid S_{jt,kn})\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) \right]</math> # Update the policy by gradient ascent: <math>\theta_{t+1} \leftarrow \theta_t + \alpha_t g_t</math>

Policy gradient method: Difference between revisions