Content deleted Content added
Line 33:
\Big|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following:
* <math display="inline">\sum_{0 \leq i\leq T} (\gamma^i R_i)</math>
* <math display="inline">\gamma^j\sum_{j \leq i\leq T} (\gamma^{i-j} R_i)</math>:
* <math display="inline">\gamma^j \sum_{j \leq i\leq T} (\gamma^{i-j} R_i) - b(S_j) </math>:
* <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>:
* <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>.
* <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''.<ref name=":0">{{Citation |last=Mnih |first=Volodymyr |title=Asynchronous Methods for Deep Reinforcement Learning |date=2016-06-16 |url=https://arxiv.org/abs/1602.01783 |doi=10.48550/arXiv.1602.01783 |last2=Badia |first2=Adrià Puigdomènech |last3=Mirza |first3=Mehdi |last4=Graves |first4=Alex |last5=Lillicrap |first5=Timothy P. |last6=Harley |first6=Tim |last7=Silver |first7=David |last8=Kavukcuoglu |first8=Koray}}</ref>
* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>:
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>:
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last=Schulman |first=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |doi=10.48550/arXiv.1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n
== Variants ==
|