Revision as of 12:32, 15 May 2025 edit Migolan (talk \| contribs) 6 edits No edit summary ← Previous edit		Revision as of 15:07, 15 May 2025 edit undo Migolan (talk \| contribs) 6 edits →Motivation Next edit →
Line 160: === Motivation === Standard policy gradient updates <math>\theta_{ti+1} = \~~theta_t~~theta_i + \alpha \nabla_\theta J(\~~theta_t~~theta_i)</math> solve a constrained optimization problem:<math display="block"> \begin{cases} \max_{\theta_{ti+1}} J(\~~theta_t~~theta_i) + (\theta_{ti+1} - \~~theta_t~~theta_i)^T \nabla_\theta J(\~~theta_t~~theta_i)\\ \\|\theta_{ti+1} - \theta_{ti}\\|\leq \alpha \cdot \\|\nabla_\theta J(\~~theta_t~~theta_i)\\| \end{cases} </math> While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\\|\theta_{ti+1} - \~~theta_t~~theta_i\\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases} \max_{\theta_{ti+1}} J(\~~theta_t~~theta_i) + (\theta_{ti+1} - \~~theta_t~~theta_i)^T \nabla_\theta J(\~~theta_t~~theta_i)\\ \bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) \leq \epsilon \end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\~~theta_t~~theta_i}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) := \mathbb E_{s \sim \pi_{\~~theta_t~~theta_i}}[D_{KL}( \pi_{\theta_{ti+1}}(\cdot \| s) \\| \pi_{\theta_{ti}}(\cdot \| s) )]</math> This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation ===

Policy gradient method: Difference between revisions