Policy gradient method: Difference between revisions

Content deleted Content added
Policy gradient: anchor REINFORCE
Line 175:
D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot | s) \| \pi_{\theta_{t}}(\cdot | s) )]</math>This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===