Revision as of 04:44, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Formulation Tag: Visual edit ← Previous edit		Revision as of 04:45, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Formulation Tag: Visual edit Next edit →
Line 194: \begin{cases} \max_{\theta} L(\theta_t, \theta)\\ \bar{D}_{KL}(\pi_{\~~theta_{t+1}~~theta} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases} </math>where Line 226: * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by repeatedly trying<math display="block"> \theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots </math>until ~~one~~a <math>\theta_{t+1}</math> is found that both satisfies the KL constraint ~~and~~<math>\bar{D}_{KL}(\pi_{\theta_{t+1}} ~~results~~\\| in\pi_{\theta_{t}}) an\leq ~~improvement~~\epsilon </math> and in a higher <math> L(\theta_t, \~~theta~~theta_{t+1}) \geq L(\theta_t, \theta_t) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.

Policy gradient method: Difference between revisions