Revision as of 21:55, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits natural policy gradient Tag: Visual edit ← Previous edit		Revision as of 03:55, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits natural policy gradient Tag: Visual edit Next edit →
Line 160: == Natural policy gradient == The natural policy gradient method is a variant of the policy gradient method, proposed by ([[Sham ~~Kakade\|~~Kakade]], in 2001).<ref>{{Cite journal \|last=Kakade \|first=Sham M \|date=2001 \|title=A Natural Policy Gradient \|url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=14}}</ref> ~~The key idea is that the~~Unlike standard policy gradient methods, ~~given~~which ~~above, involve optimizing <math>J(\theta)</math> by taking its gradient <math>\nabla_\theta J(\theta)</math>. However, this gradient depends~~depend on the ~~particular~~ choice of ~~the coordinate~~parameters <math>\theta</math>. ~~So,~~(making ~~for~~updates ~~example~~coordinate-dependent), ifthe wenatural ~~were~~policy togradient ~~change~~aims ~~the~~to ~~coordinates~~provide bya ~~<math>\theta'~~[[coordinate-free]] ~~= 2\theta </math>~~update, ~~where <math>f</math>~~which is ~~some~~geometrically ~~smooth function, then we would obtain a new policy gradient <math>\nabla_{\theta'} J(\theta') = \frac 12 \nabla_\theta J(\theta) </math>~~"natural". === Motivation === Thus, policy gradient method is "unnatural" in the geometric sense, since its updates depends on the choice of coordinates. A "natural" policy gradient would change it so that the policy updates are [[coordinate-free]]. Standard policy gradient updates <math>\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)</math> solve a constrained optimization problem:<math display="block"> \begin{cases} \max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\ \\|\theta_{t+1} - \theta_{t}\\|\leq \alpha \cdot \\|\nabla_\theta J(\theta_t)\\| \end{cases} </math> While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\\|\theta_{t+1} - \theta_t\\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block"> \begin{cases} \max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\ D_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases} </math>This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation === For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block"> D_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \approx \frac{1}{2} (\theta_{t+1} - \theta_t)^T F(\theta_t) (\theta_{t+1} - \theta_t) </math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block"> F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a\|s) \left(\nabla_\theta \ln \pi_\theta(a\|s)\right)^T \right] </math>This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block"> \theta_{t+1} = \theta_t + \alpha F(\theta_t)^{-1} \nabla_\theta J(\theta_t) </math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \propto \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>. === Practical considerations === Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations: * [[Conjugate gradient method]] to compute <math display="inline">F(\theta)^{-1} \nabla_\theta J(\theta) </math> without explicit inversion.<ref name=":3" /> * Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" /> * [[Proximal policy optimization]] (PPO), which approximates the natural gradient with clipped probability ratios.<ref name=":0" /> These methods address the trade-off between inversion complexity and policy update stability, making natural policy gradients feasible in large-scale applications. == See also ==

Policy gradient method: Difference between revisions