Policy gradient method: Difference between revisions

Content deleted Content added
Migolan (talk | contribs)
No edit summary
Line 124:
 
The algorithm uses the modified gradient estimator<math display="block">g_t \leftarrow
\frac 1N \sum_{k=1}^N \left[\sum_{j\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{j,k}| S_{ij,k})\left(\sum_{i \in j:T} (\gamma^i R_{i,k}) - b_t(S_{j,k})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>b_t = 0</math>.
 
=== Actor-critic methods ===