Policy gradient method: Difference between revisions

Content deleted Content added
natural policy gradient
natural policy gradient
Line 160:
 
== Natural policy gradient ==
The natural policy gradient method is a variant of the policy gradient method, proposed by ([[Sham Kakade|Kakade]], in 2001).<ref>{{Cite journal |last=Kakade |first=Sham M |date=2001 |title=A Natural Policy Gradient |url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=14}}</ref> The key idea is that theUnlike standard policy gradient methods, givenwhich above, involve optimizing <math>J(\theta)</math> by taking its gradient <math>\nabla_\theta J(\theta)</math>. However, this gradient dependsdepend on the particular choice of the coordinateparameters <math>\theta</math>. So,(making forupdates examplecoordinate-dependent), ifthe wenatural werepolicy togradient changeaims theto coordinatesprovide bya <math>\theta'[[coordinate-free]] = 2\theta </math>update, where <math>f</math>which is somegeometrically smooth function, then we would obtain a new policy gradient <math>\nabla_{\theta'} J(\theta') = \frac 12 \nabla_\theta J(\theta) </math>"natural".
 
=== Motivation ===
Thus, policy gradient method is "unnatural" in the geometric sense, since its updates depends on the choice of coordinates. A "natural" policy gradient would change it so that the policy updates are [[coordinate-free]].
Standard policy gradient updates <math>\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)</math> solve a constrained optimization problem:<math display="block">
\begin{cases}
\max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\
\|\theta_{t+1} - \theta_{t}\|\leq \alpha \cdot \|\nabla_\theta J(\theta_t)\|
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{t+1} - \theta_t\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">
\begin{cases}
\max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\
D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \approx \frac{1}{2} (\theta_{t+1} - \theta_t)^T F(\theta_t) (\theta_{t+1} - \theta_t)
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math>This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{t+1} = \theta_t + \alpha F(\theta_t)^{-1} \nabla_\theta J(\theta_t)
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \propto \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>.
 
=== Practical considerations ===
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations:
* [[Conjugate gradient method]] to compute <math display="inline">F(\theta)^{-1} \nabla_\theta J(\theta) </math> without explicit inversion.<ref name=":3" />
* Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" />
* [[Proximal policy optimization]] (PPO), which approximates the natural gradient with clipped probability ratios.<ref name=":0" />
 
These methods address the trade-off between inversion complexity and policy update stability, making natural policy gradients feasible in large-scale applications.
 
== See also ==