Policy gradient method: Difference between revisions

Content deleted Content added
Line 193:
</math> with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{w_t^T F(\theta_t)w_t }}</math>.
* Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" />
* [[Proximal policy optimization]] (PPO), which approximatesavoids theboth natural<math>F(\theta)</math> gradientand with<math>F(\theta)^{-1}</math> by a first-order approximation, using clipped probability ratios.<ref name=":0" />
 
These methods address the trade-off between inversion complexity and policy update stability, making natural policy gradients feasible in large-scale applications.