Revision as of 06:02, 21 January 2025 edit BattyBot (talk \| contribs) Bots 1,957,349 edits Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes Tag: AWB ← Previous edit		Revision as of 16:28, 27 January 2025 edit undo Citation bot (talk \| contribs) Bots 5,864,545 edits Altered url. URLs might have been anonymized. Add: authors 1-1. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Abductive \| Category:Reinforcement learning \| #UCB_Category 14/14 Next edit →
Line 39: * <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>: [[Temporal difference learning\|TD(1) learning]]. * <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>. * <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''.<ref name=":0">{{Citation \|last1=Mnih \|first1=Volodymyr \|title=Asynchronous Methods for Deep Reinforcement Learning \|date=2016-06-16 ~~\|url=https://arxiv.org/abs/1602.01783~~ \|arxiv=1602.01783 \|last2=Badia \|first2=Adrià Puigdomènech \|last3=Mirza \|first3=Mehdi \|last4=Graves \|first4=Alex \|last5=Lillicrap \|first5=Timothy P. \|last6=Harley \|first6=Tim \|last7=Silver \|first7=David \|last8=Kavukcuoglu \|first8=Koray}}</ref> * <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(2) learning. * <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(n) learning. * <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref name="arxiv.org">{{Citation \|last1=Schulman \|first1=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 ~~\|url=https://arxiv.org/abs/1506.02438~~ \|arxiv=1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n) learning terms. === Critic === Line 66: * '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing\|Parallel and asynchronous]] version of A2C.<ref name=":0" /> * '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation \|last1=Haarnoja \|first1=Tuomas \|title=Soft Actor-Critic Algorithms and Applications \|date=2019-01-29 ~~\|url=https://arxiv.org/abs/1812.05905~~ \|arxiv=1812.05905 \|last2=Zhou \|first2=Aurick \|last3=Hartikainen \|first3=Kristian \|last4=Tucker \|first4=George \|last5=Ha \|first5=Sehoon \|last6=Tan \|first6=Jie \|last7=Kumar \|first7=Vikash \|last8=Zhu \|first8=Henry \|last9=Gupta \|first9=Abhishek}}</ref> * '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation \|last1=Lillicrap \|first1=Timothy P. \|title=Continuous control with deep reinforcement learning \|date=2019-07-05 ~~\|url=https://arxiv.org/abs/1509.02971~~ \|arxiv=1509.02971 \|last2=Hunt \|first2=Jonathan J. \|last3=Pritzel \|first3=Alexander \|last4=Heess \|first4=Nicolas \|last5=Erez \|first5=Tom \|last6=Tassa \|first6=Yuval \|last7=Silver \|first7=David \|last8=Wierstra \|first8=Daan}}</ref> == See also == Line 80: * {{Cite book \|last=Bertsekas \|first=Dimitri P. \|title=Reinforcement learning and optimal control \|date=2019 \|publisher=Athena Scientific \|isbn=978-1-886529-39-7 \|edition=2 \|___location=Belmont, Massachusetts}} * {{Cite book \|last=Grossi \|first=Csaba \|title=Algorithms for Reinforcement Learning \|date=2010 \|publisher=Springer International Publishing \|isbn=978-3-031-00423-0 \|edition=1 \|series=Synthesis Lectures on Artificial Intelligence and Machine Learning \|___location=Cham}} * {{Cite journal \|~~last~~last1=Grondman \|~~first~~first1=Ivo \|last2=Busoniu \|first2=Lucian \|last3=Lopes \|first3=Gabriel A. D. \|last4=Babuska \|first4=Robert \|date=November 2012 \|title=A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients \|url=~~http~~https://ieeexplore.ieee.org/document/6392457/ \|journal=IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) \|volume=42 \|issue=6 \|pages=1291–1307 \|doi=10.1109/TSMCC.2012.2218595 \|issn=1094-6977}} [[Category:Reinforcement learning]]

Actor-critic algorithm: Difference between revisions