Revision as of 04:06, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Motivation: KL Tag: Visual edit ← Previous edit		Revision as of 04:10, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Fisher information approximation: invert Tag: Visual edit Next edit →
Line 184: </math>This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block"> \theta_{t+1} = \theta_t + \alpha F(\theta_t)^{-1} \nabla_\theta J(\theta_t) </math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \~~propto~~approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>. === Practical considerations === Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations: * [[Conjugate gradient method]] to compute <math display="inline">F(\theta)^{-1} \nabla_\theta J(\theta) </math> without explicit inversion.<ref name=":3" /> * Instead of computing <math>F(\theta)^{-1} </math>, iteratively solve for <math>\nabla J(\theta_t) = F(\theta_t) w_t</math> using the previous step's <math>w_{t-1}</math> as an initial guess, then update by <math display="inline"> \theta_{t+1} = \theta_t + \alpha w_t </math> with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{w_t^T F(\theta_t)w_t }}</math>. * Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" /> * [[Proximal policy optimization]] (PPO), which approximates the natural gradient with clipped probability ratios.<ref name=":0" />

Policy gradient method: Difference between revisions