Content deleted Content added
Line 193:
</math> with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{w_t^T F(\theta_t)w_t }}</math>.
* Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" />
* [[Proximal policy optimization]] (PPO), which
These methods address the trade-off between inversion complexity and policy update stability, making natural policy gradients feasible in large-scale applications.
|