Policy gradient method: Difference between revisions

Content deleted Content added
Migolan (talk | contribs)
No edit summary
Migolan (talk | contribs)
No edit summary
Line 120:
 
=== REINFORCE with baseline ===
A common way for reducing variance is the '''REINFORCE with baseline''' algorithm, based on the following identity:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau) - b(S_jS_t)\right)
\Big|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma.
 
The algorithm uses the modified gradient estimator<math display="block">g_t \leftarrow
\frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}| S_{jt,kn})\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) - b_tb(S_{jt,kn})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>b_tb =\equiv 0</math>.
 
=== Actor-critic methods ===