Deep reinforcement learning: Difference between revisions

Content deleted Content added
Tsesea (talk | contribs)
Tag: Reverted
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(36 intermediate revisions by 23 users not shown)
Line 1:
{{Short description|Machine learning that combines deep learning and reinforcement learning}}
{{Machine learning}}
'''Deep reinforcement learning''' ('''deep RL''') is a subfield of [[machine learning]] that combines [[reinforcement learning]] (RL) and [[deep learning]]. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the [[state space]]. Deep RL algorithms are able to take in very large inputs (e.g. every pixel rendered to the screen in a video game) and decide what actions to perform to optimize an objective (e.g. maximizing the game score). Deep reinforcement learning has been used for a diverse set of applications including but not limited to [[robotics]], [[video game]]s, [[natural language processing]], [[computer vision]],<ref>{{Cite journal |last1=Le |first1=Ngan |last2=Rathour |first2=Vidhiwar Singh |last3=Yamazaki |first3=Kashu |last4=Luu |first4=Khoa |last5=Savvides |first5=Marios |date=2022-04-01 |title=Deep reinforcement learning in computer vision: a comprehensive survey |url=https://doi.org/10.1007/s10462-021-10061-9 |journal=Artificial Intelligence Review |language=en |volume=55 |issue=4 |pages=2733–2819 |doi=10.1007/s10462-021-10061-9 |issn=1573-7462|arxiv=2108.11510 }}</ref> education, transportation, finance and [[Health care|healthcare]].<ref name="francoislavet2018"/>
 
== Overview ==
Line 7:
=== Deep learning ===
[[File:Neural_network_example.svg|thumb|241x241px|Depiction of a basic artificial neural network]]
[[Deep learning]] is a form of [[machine learning]] that utilizes a neural network to transform a set of inputs into a set of outputs via an [[artificial neural network]]. Deep learning methods, often using [[supervised learning]] with labeled datasets, have been shown to solve tasks that involve handling complex, high-dimensional raw input data (such as images,) with less manual [[feature engineering]] than prior methods, enabling significant progress in several fields including [[computer vision]] and [[natural language processing]]. In the past decade, deep RL has achieved remarkable results on a range of problems, from single and multiplayer games such as [[Go (game)|GOGo]], [[Atari Games]], and ''[[Dota 2]],'' to robotics.<ref>{{Cite web |last=Graesser |first=Laura |title=Foundations of Deep Reinforcement Learning: Theory and Practice in Python |url=https://openlibrary.telkomuniversity.ac.id/home/catalog/id/198650/slug/foundations-of-deep-reinforcement-learning-theory-and-practice-in-python.html |access-date=2023-07-01 |website=Open Library Telkom University}}</ref>
 
=== Reinforcement learning ===
Line 24:
Katsunari Shibata's group showed that various functions emerge in this framework,<ref name="Shibata3"/><ref name="Shibata4"/><ref name="Shibata2"/> including image recognition, color constancy, sensor motion (active recognition), hand-eye coordination and hand reaching movement, explanation of brain activities, knowledge transfer, memory,<ref name="Shibata5"/> selective attention, prediction, and exploration.<ref name="Shibata4"/><ref name="Shibata6"/>
 
Starting around 2012, the so -called [[Deep learning#Deep learning revolution | Deepdeep learning revolution]] led to an increased interest in using deep neural networks as function approximators across a variety of domains. This led to a renewed interest in researchers using deep neural networks to learn the policy, value, and/or Q functions present in existing reinforcement learning algorithms.
 
Beginning around 2013, [[DeepMind]] showed impressive learning results using deep RL to play [[Atari]] video games.<ref name="DQN1"/><ref name="DQN2"/> The computer player a neural network trained using a deep RL algorithm, a deep version of [[Q-learning]] they termed deep Q-networks (DQN), with the game score as the reward. They used a deep [[convolutional neural network]] to process 4 frames RGB pixels (84x84) as inputs. All 49 games were learned using the same network architecture and with minimal prior knowledge, outperforming competing methods on almost all the games and performing at a level comparable or superior to a professional human game tester.<ref name="DQN2" />
 
Deep reinforcement learning reached another milestone in 2015 when [[AlphaGo]],<ref name="AlphaGo"/> a computer program trained with deep RL to play [[Go (game)|Go]], became the first computer Go program to beat a human professional Go player without handicap on a full-sized 19×19 board.
In a subsequent project in 2017, [[AlphaZero]] improved performance on Go while also demonstrating they could use the same algorithm to learn to play [[chess]] and [[shogi]] at a level competitive or superior to existing computer programs for those games, and again improved in 2019 with [[MuZero]].<ref name="muzero"/> Separately, another milestone was achieved by researchers from [[Carnegie Mellon University]] in 2019 developing [[Pluribus (poker bot)|Pluribus]], a computer program to play [[poker]] that was the first to beat professionals at multiplayer games of no-limit [[Texas hold 'em]]. [[OpenAI Five]], a program for playing five-on-five ''[[Dota 2]]'' beat the previous world champions in a demonstration match in 2019.
 
Deep reinforcement learning has also been applied to many domains beyond games. In robotics, it has been used to let robots perform simple household tasks<ref name="levine2016"/> and solve a Rubik's cube with a robot hand.<ref name="openaihand"/><ref name="openaihandarxiv"/> Deep RL has also found sustainability applications, used to reduce energy consumption at data centers.<ref name="deepmindcooling"/> Deep RL for [[autonomous driving]] is an active area of research in academia and industry.<ref name="neurips2021ml4ad"/> [[Loon_LLC|Loon]] explored deep RL for autonomously navigating their high-altitude balloons.<ref name="loonrl"/>
 
== Algorithms ==
 
Deep reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Mainstream DRL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional SAC (DSAC), etc. These algorithms are proposed with one or several of the abovementioned tricks to alleviate one or some challenges <ref name="Li-2023"/>.
 
{| class="wikitable sortable" style="font-size: 96%;"
!Algorithm || class=unsortable|Description || class=unsortable|Model || Policy || class=unsortable |Action Space || class=unsortable |State Space ||Operator
|-
! scope="row" | [[Q-learning#Deep Q-learning|DQN]]
| Deep Q Network || Model-Free || Off-policy || Discrete || Continuous || Q-value
|-
! scope="row" | [[Deep Deterministic Policy Gradient|DDPG]]
| Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
|-
! scope="row" | [[Asynchronous Advantage Actor-Critic Algorithm|A3C]]
| Asynchronous Advantage Actor-Critic Algorithm || Model-Free || On-policy || Continuous || Continuous || Advantage
|-
! scope="row" | [[Trust Region Policy Optimization|TRPO]]
| Trust Region Policy Optimization || Model-Free || On-policy || Continuous or Discrete || Continuous || Advantage
|-
! scope="row" | [[Proximal Policy Optimization|PPO]]
| Proximal Policy Optimization || Model-Free || On-policy || Continuous or Discrete || Continuous || Advantage
|-
! scope="row" | [[Twin Delayed Deep Deterministic Policy Gradient|TD3]]
| Twin Delayed Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
|-
! scope="row" | [[Soft Actor-Critic|SAC]]
| Soft Actor-Critic || Model-Free || Off-policy || Continuous || Continuous || Advantage
|-
!scope="row" |[[Distributional Soft Actor-Critic|DSAC]]
|Distributional Soft Actor-Critic ||Model-free ||Off-policy ||Continuous ||Continuous ||Value distribution
|}
 
[[File:Challenges and Tricks of Deep RL.jpg|thumb|Challenges and tricks in deep reinforcement learning algorithms]]
Previously, it was believed that deep reinforcement learning (DRL) was a natural product of combining tabular RL and deep neural network, and its design was a trivial task. In practice, deep reinforcement learning is fundamentally complicated because it inherits a few serious challenges from both reinforcement learning and deep learning. Some challenges, including non-iid sequential data, easy divergence, overestimation, and sample inefficiency yield particularly destructive outcomes if they are not well treated. A few empirical but useful tricks have been proposed to address these prominent issues, which build the basis of various advanced DRL algorithms. These tricks include experience replay (ExR), parallel exploration (PEx), separated target network (STN), delayed policy update (DPU), constrained policy update (CPU), clipped actor criterion (CAC), double Q-functions (DQF), bounded double Q-functions (BDQ), distributional return function (DRF), entropy regularization (EnR), and soft value function (SVF) <ref name="Li-2023">{{cite book |last1=Li |first1=Shengbo |title= Reinforcement Learning for Sequential Decision and Optimal Control |date=2023 |___location=Springer Verlag, Singapore |isbn=978-9-811-97783-1 |pages=1–460 |doi=10.1007/978-981-19-7784-8 |s2cid=257928563 |edition=First | url=https://link.springer.com/book/10.1007/978-981-19-7784-8}}</ref>.
 
Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.
 
Line 79 ⟶ 45:
=== Exploration ===
 
An RL agent must balance the exploration/exploitation tradeoff: the problem of deciding whether to pursue actions that are already known to yield high rewards or explore other actions in order to discover higher rewards. RL agents usually collect data with some type of stochastic policy, such as a [[Boltzmann distribution]] in discrete action spaces or a [[Normal distribution|Gaussian distribution]] in continuous action spaces, inducing basic exploration behavior. The idea behind novelty-based, or curiosity-driven, exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify[ing] the loss function (or even the network architecture) by adding terms to incentivize exploration".<ref>{{cite book|last1=Reizinger|first1=Patrik|last2=Szemenyei|first2=Márton|date=2019-10-23|title=ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)|chapter=Attention-Based Curiosity-Driven Exploration in Deep Reinforcement Learning |pages=3542–3546 |doi=10.1109/ICASSP40776.2020.9054546 |arxiv=1910.10840|isbn=978-1-5090-6631-5 |s2cid=204852215 }}</ref> An agent may also be aided in exploration by utilizing demonstrations of successful trajectories, or reward-shaping, giving an agent intermediate rewards that are customized to fit the task it is attempting to complete.<ref>{{Citation|last=Wiewiora|first=Eric|title=Reward Shaping|date=2010|url=https://doi.org/10.1007/978-0-387-30164-8_731|encyclopedia=Encyclopedia of Machine Learning|pages=863–865|editor-last=Sammut|editor-first=Claude|place=Boston, MA|publisher=Springer US|language=en|doi=10.1007/978-0-387-30164-8_731|isbn=978-0-387-30164-8|access-date=2020-11-16|editor2-last=Webb|editor2-first=Geoffrey I.|url-access=subscription}}</ref>
 
=== Off-policy reinforcement learning ===
Line 107 ⟶ 73:
<ref name="francoislavet2018">{{Cite journal|last1=Francois-Lavet|first1=Vincent|last2=Henderson|first2=Peter|last3=Islam|first3=Riashat|last4=Bellemare|first4=Marc G.|last5=Pineau|first5=Joelle|date=2018|title=An Introduction to Deep Reinforcement Learning|journal=Foundations and Trends in Machine Learning|volume=11|issue=3–4|pages=219–354|arxiv=1811.12560|bibcode=2018arXiv181112560F|doi=10.1561/2200000071|issn=1935-8237|s2cid=54434537}}</ref>
<ref name="Hassabis">{{cite speech |last1=Demis |first1=Hassabis | date=March 11, 2016 |title= Artificial Intelligence and the Future. |url= https://www.youtube.com/watch?v=8Z2eLTSCuBk}}</ref>
<ref name="TD-Gammon">{{cite journal | url=http://www.bkgm.com/articles/tesauro/tdl.html | title=Temporal Difference Learning and TD-Gammon | date=March 1995 | last=Tesauro | first=Gerald | journal=Communications of the ACM | volume=38 | issue=3 | doi=10.1145/203330.203343 | pages=58–68 | s2cid=8763243 | doi-access-date=2017-03-10 | archive-url=https://web.archive.org/web/20100209103427/http://www.bkgm.com/articles/tesauro/tdl.html | archive-date=2010-02-09 | url-status=deadfree }}</ref>
<ref name="sutton1996">{{cite book |last1=Sutton |first1=Richard |last2=Barto |first2=Andrew |date=September 1996 |title=Reinforcement Learning: An Introduction |publisher=Athena Scientific}}</ref>
<ref name="tsitsiklis1996">{{cite book |last1=Bertsekas |first2=Dimitri |last2=Tsitsiklis |first1=John |date=September 1996 |title=Neuro-Dynamic Programming |url=http://athenasc.com/ndpbook.html |publisher=Athena Scientific |isbn=1-886529-10-8}}</ref>
Line 120 ⟶ 86:
<ref name="AlphaGo">{{Cite journal|title = Mastering the game of Go with deep neural networks and tree search|journal = [[Nature (journal)|Nature]]| issn= 0028-0836|pages = 484–489|volume = 529|issue = 7587|doi = 10.1038/nature16961|pmid = 26819042|first1 = David|last1 = Silver|author-link1=David Silver (programmer)|first2 = Aja|last2 = Huang|author-link2=Aja Huang|first3 = Chris J.|last3 = Maddison|first4 = Arthur|last4 = Guez|first5 = Laurent|last5 = Sifre|first6 = George van den|last6 = Driessche|first7 = Julian|last7 = Schrittwieser|first8 = Ioannis|last8 = Antonoglou|first9 = Veda|last9 = Panneershelvam|first10= Marc|last10= Lanctot|first11= Sander|last11= Dieleman|first12=Dominik|last12= Grewe|first13= John|last13= Nham|first14= Nal|last14= Kalchbrenner|first15= Ilya|last15= Sutskever|author-link15=Ilya Sutskever|first16= Timothy|last16= Lillicrap|first17= Madeleine|last17= Leach|first18= Koray|last18= Kavukcuoglu|first19= Thore|last19= Graepel|first20= Demis |last20=Hassabis|author-link20=Demis Hassabis|date= 28 January 2016|bibcode = 2016Natur.529..484S|s2cid = 515925}}{{closed access}}</ref>
<ref name="levine2016">{{Cite journal |last1=Levine |first1=Sergey |last2=Finn |first2=Chelsea |author-link2=Chelsea Finn |last3=Darrell |first3=Trevor |last4=Abbeel |first4=Pieter |date=January 2016 |title=End-to-end training of deep visuomotor policies |url=https://www.jmlr.org/papers/volume17/15-389/15-389.pdf |journal=JMLR |volume=17 |arxiv=1504.00702}}</ref>
<ref name="openaihand">{{Cite web|title=OpenAI - Solving Rubik's Cube With A Robot Hand|url=https://openai.com/blog/solving-rubiks-cube/|website=OpenAI|date=5 January 2021 }}</ref>
<ref name="openaihandarxiv">{{Cite conference|title= Solving Rubik's Cube with a Robot Hand |last1=OpenAI |display-authors=etal|date=2019|arxiv=1910.07113 }}</ref>
<ref name="deepmindcooling">{{Cite web|title=DeepMind AI Reduces Google Data Centre Cooling Bill by 40% |url=https://deepmind.com/blog/article/deepmind-ai-reduces-google-data-centre-cooling-bill-40|website=DeepMind|date=14 May 2024 }}</ref>
<ref name="neurips2021ml4ad">{{Cite web|title=Machine Learning for Autonomous Driving Workshop @ NeurIPS 2021|url=https://ml4ad.github.io/|website=NeurIPS 2021|date=December 2021}}</ref>
<ref name="williams1992">{{Cite journal|last1=Williams|first1=Ronald J|journal=Machine Learning|pages=229–256|title = Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning|date=1992|volume=8|issue=3–4|doi=10.1007/BF00992696|s2cid=2332513|doi-access=free}}</ref>
Line 133 ⟶ 99:
<ref name="schaul2015uva">{{Cite conference| title=Universal Value Function Approximators|last1=Schaul|first1=Tom |last2=Horgan|first2=Daniel |last3=Gregor|first3=Karol |last4=Silver|first4=David |conference=International Conference on Machine Learning (ICML) |date=2015| url=http://proceedings.mlr.press/v37/schaul15.html}}</ref>
<ref name="muzero">{{cite journal |last1=Schrittwieser |first1=Julian |last2=Antonoglou |first2=Ioannis |last3=Hubert |first3=Thomas |last4=Simonyan |first4=Karen |last5=Sifre |first5=Laurent |last6=Schmitt |first6=Simon |last7=Guez |first7=Arthur |last8=Lockhart |first8=Edward |last9=Hassabis |first9=Demis |last10=Graepel |first10=Thore |last11=Lillicrap |first11=Timothy |last12=Silver |first12=David |title=Mastering Atari, Go, chess and shogi by planning with a learned model |journal=Nature |date=23 December 2020 |volume=588 |issue=7839 |pages=604–609 |doi=10.1038/s41586-020-03051-4 |pmid=33361790 |url=https://www.nature.com/articles/s41586-020-03051-4|arxiv=1911.08265 |bibcode=2020Natur.588..604S |s2cid=208158225 }}</ref>
<ref name="loonrl">{{cite journal |last1=Bellemare |first1=Marc |last2=Candido |first2=Salvatore |last3=Castro |first3=Pablo |last4=Gong |first4=Jun |last5=Machado |first5=Marlos |last6=Moitra |first6=Subhodeep |last7=Ponda |first7=Sameera |last8=Wang |first8=Ziyu |title=Autonomous navigation of stratospheric balloons using reinforcement learning |journal=Nature |date=2 December 2020 |volume=588 |issue=7836 |pages=77–82 |doi=10.1038/s41586-020-2939-8 |pmid=33268863 |bibcode=2020Natur.588...77B |s2cid=227260253 |url=https://www.nature.com/articles/s41586-020-2939-8|url-access=subscription }}</ref>
<ref name="deepirl">{{cite arXiv| last1=Wulfmeier|first1=Markus|last2=Ondruska|first2=Peter|last3=Posner|first3=Ingmar|date=2015|title= Maximum Entropy Deep Inverse Reinforcement Learning |class=cs.LG|eprint=1507.04888}}</ref>
</references>