Revision as of 15:07, 15 May 2025 edit Migolan (talk \| contribs) 6 edits →Motivation ← Previous edit		Revision as of 15:08, 15 May 2025 edit undo Migolan (talk \| contribs) 6 edits →Fisher information approximation Next edit →
Line 173: === Fisher information approximation === For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block"> \bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) \approx \frac{1}{2} (\theta_{ti+1} - \~~theta_t~~theta_i)^T F(\~~theta_t~~theta_i) (\theta_{ti+1} - \~~theta_t~~theta_i) </math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block"> F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a\|s) \left(\nabla_\theta \ln \pi_\theta(a\|s)\right)^T \right] </math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block"> \theta_{ti+1} = \~~theta_t~~theta_i + \alpha F(\~~theta_t~~theta_i)^{-1} \nabla_\theta J(\~~theta_t~~theta_i) </math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\~~theta_t~~theta_i))^T F(\~~theta_t~~theta_i)^{-1} \nabla_\theta J(\~~theta_t~~theta_i)}}</math>. Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations. == Trust Region Policy Optimization (TRPO) == {{Anchor\|TRPO}}

Policy gradient method: Difference between revisions