Revision as of 10:07, 15 May 2025 edit Migolan (talk \| contribs) 6 edits No edit summary ← Previous edit		Revision as of 11:57, 15 May 2025 edit undo Migolan (talk \| contribs) 6 edits No edit summary Next edit →
Line 120: === REINFORCE with baseline === A common way for reducing variance is the '''REINFORCE with baseline''' algorithm, based on the following identity:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) - b(~~S_j~~S_t)\right) \Big\|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator<math display="block">g_t \leftarrow \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}\| S_{jt,kn})\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) - ~~b_t~~b(S_{jt,kn})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>~~b_t~~b =\equiv 0</math>. === Actor-critic methods ===

Policy gradient method: Difference between revisions