Policy gradient method: Difference between revisions

Content deleted Content added
Migolan (talk | contribs)
No edit summary
Migolan (talk | contribs)
Line 160:
 
=== Motivation ===
Standard policy gradient updates <math>\theta_{ti+1} = \theta_ttheta_i + \alpha \nabla_\theta J(\theta_ttheta_i)</math> solve a constrained optimization problem:<math display="block">
\begin{cases}
\max_{\theta_{ti+1}} J(\theta_ttheta_i) + (\theta_{ti+1} - \theta_ttheta_i)^T \nabla_\theta J(\theta_ttheta_i)\\
\|\theta_{ti+1} - \theta_{ti}\|\leq \alpha \cdot \|\nabla_\theta J(\theta_ttheta_i)\|
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{ti+1} - \theta_ttheta_i\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases}
\max_{\theta_{ti+1}} J(\theta_ttheta_i) + (\theta_{ti+1} - \theta_ttheta_i)^T \nabla_\theta J(\theta_ttheta_i)\\
\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \leq \epsilon
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\theta_ttheta_i}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) := \mathbb E_{s \sim \pi_{\theta_ttheta_i}}[D_{KL}( \pi_{\theta_{ti+1}}(\cdot | s) \| \pi_{\theta_{ti}}(\cdot | s) )]</math> This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===