Content deleted Content added
No edit summary |
I fixed the formula so it would express "the total reward from time $ |
||
Line 102:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right]
</math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>.
|