Revision as of 02:45, 13 April 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO) Tag: Visual edit ← Previous edit		Revision as of 10:07, 15 May 2025 edit undo Migolan (talk \| contribs) 6 edits No edit summary Next edit →
Line 124: The algorithm uses the modified gradient estimator<math display="block">g_t \leftarrow \frac 1N \sum_{k=1}^N \left[\sum_{j\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{j,k}\| S_{ij,k})\left(\sum_{i \in j:T} (\gamma^i R_{i,k}) - b_t(S_{j,k})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>b_t = 0</math>. === Actor-critic methods ===

Policy gradient method: Difference between revisions