Policy gradient method: Difference between revisions

Content deleted Content added
Line 194:
\begin{cases}
\max_{\theta} L(\theta_t, \theta)\\
\bar{D}_{KL}(\pi_{\theta_{t+1}theta} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>where
Line 226:
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by repeatedly trying<math display="block">
\theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots
</math>until onea <math>\theta_{t+1}</math> is found that both satisfies the KL constraint and<math>\bar{D}_{KL}(\pi_{\theta_{t+1}} results\| in\pi_{\theta_{t}}) an\leq improvement\epsilon </math> and in a higher <math>
L(\theta_t, \thetatheta_{t+1}) \geq L(\theta_t, \theta_t)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.