Policy gradient method: Difference between revisions

Content deleted Content added
Migolan (talk | contribs)
Migolan (talk | contribs)
Line 173:
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \approx \frac{1}{2} (\theta_{ti+1} - \theta_ttheta_i)^T F(\theta_ttheta_i) (\theta_{ti+1} - \theta_ttheta_i)
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \alpha F(\theta_ttheta_i)^{-1} \nabla_\theta J(\theta_ttheta_i)
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_ttheta_i))^T F(\theta_ttheta_i)^{-1} \nabla_\theta J(\theta_ttheta_i)}}</math>.
 
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
 
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}