Policy gradient method: Difference between revisions

Content deleted Content added
Line 173:
\max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\
\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot | s) \| \pi_{\theta_{t}}(\cdot | s) )]</math>This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===