Content deleted Content added
merge hidden blocks |
mNo edit summary |
||
Line 47:
}}
{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=
{{Math proof|title=Proof of
Use the [[reparameterization trick#REINFORCE estimator|reparameterization trick]].
Line 84:
}}
{{Math proof|title=Proof of the two identities|proof=
Applying the [[reparameterization trick#REINFORCE estimator|reparameterization trick]],
Line 223:
x
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement.
\theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_t + \alpha^2 \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \dots
</math> until
L(\theta_{t+1}, \theta_t) \geq L(\theta_t, \theta_t)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
|