Policy gradient method: Difference between revisions

Content deleted Content added
Line 184:
</math>This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{t+1} = \theta_t + \alpha F(\theta_t)^{-1} \nabla_\theta J(\theta_t)
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \proptoapprox \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>.
 
=== Practical considerations ===
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations:
* [[Conjugate gradient method]] to compute <math display="inline">F(\theta)^{-1} \nabla_\theta J(\theta) </math> without explicit inversion.<ref name=":3" />
* Instead of computing <math>F(\theta)^{-1} </math>, iteratively solve for <math>\nabla J(\theta_t) = F(\theta_t) w_t</math> using the previous step's <math>w_{t-1}</math> as an initial guess, then update by <math display="inline">
\theta_{t+1} = \theta_t + \alpha w_t
</math> with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{w_t^T F(\theta_t)w_t }}</math>.
* Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" />
* [[Proximal policy optimization]] (PPO), which approximates the natural gradient with clipped probability ratios.<ref name=":0" />