Revision as of 04:41, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Formulation Tag: Visual edit ← Previous edit		Revision as of 04:44, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Formulation Tag: Visual edit Next edit →
Line 218: This reduces the problem to a quadratic optimization, yielding the natural policy gradient update: <math display="block"> \theta_{t+1} = \theta_t + \sqrt{\frac{2\~~delta~~epsilon}{g^T F^{-1} g}} HF^{-1} g. </math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications: Line 224: x </math> in <math>Fx = g</math> iteratively without explicit matrix inversion. * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by repeatedly trying<math display="block"> \theta_{t+1} = \theta_t + \~~alpha~~sqrt{\frac{2\epsilon}{x^jT F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots </math>until one is found that both satisfies the KL constraint and results in an improvement in <math> ~~</math>where <math>\alpha \in (0,1)</math> is a backtracking coefficient, and <math>j</math> is the smallest integer satisfying the KL constraint and a positive surrogate advantage.~~ L(\theta_t, \theta) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient. A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />

Policy gradient method: Difference between revisions