Revision as of 02:39, 13 April 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits merge hidden blocks Tag: 2017 wikitext editor ← Previous edit		Revision as of 02:41, 13 April 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits mNo edit summary Tag: 2017 wikitext editor Next edit →
Line 47: }} {{hidden begin\|style=width:100%\|ta1=center\|border=1px #aaa solid\|title=~~Proof~~Proofs}} {{Math proof\|title=Proof of ~~Lemma~~the lemma\|proof= Use the [[reparameterization trick#REINFORCE estimator\|reparameterization trick]]. Line 84: }} {{Math proof\|title=Proof of the two identities\|proof= Applying the [[reparameterization trick#REINFORCE estimator\|reparameterization trick]], Line 223: x </math> in <math>Fx = g</math> iteratively without explicit matrix inversion. * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement. byThat ~~repeatedly~~is, ~~trying~~it tests each of the following test-solutions<math display="block"> \theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_t + \alpha^2 \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \dots </math> until ait ~~<math>\theta_{t+1}</math>~~finds ~~is found~~one that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon </math> and results in a higher <math> L(\theta_{t+1}, \theta_t) \geq L(\theta_t, \theta_t) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.

Policy gradient method: Difference between revisions