Deep reinforcement learning: Difference between revisions

Content deleted Content added
WikiCleanerBot (talk | contribs)
m v2.05b - Bot T19 CW#25 - Fix errors for CW project (Heading hierarchy)
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(14 intermediate revisions by 9 users not shown)
Line 1:
{{Short description|Machine learning that combines deep learning and reinforcement learning}}
'''Deep reinforcement learning''' ('''DRL''') is a subfield of [[machine learning]] that combines principles of [[reinforcement learning]] (RL) and [[deep learning]]. It involves training agents to make decisions by interacting with an environment to maximize cumulative rewards, while using [[Artificial neural networks|deep neural networks]] to represent policies, value functions, or environment models. This integration enables DRL systems to process high-dimensional inputs, such as images or continuous control signals, making the approach effective for solving complex tasks. Since the introduction of the [[Q-learning|deep Q-network (DQN)]] in 2015, DRL has achieved significant successes across domains including [[Video game|games]], [[robotics]], and [[Autonomous system|autonomous systems]], and is increasingly applied in areas such as healthcare, finance, and autonomous vehicles.
{{Machine learning}}
<!--Per WP:CITELEAD, references are not needed in the lead if it is sourced in the body of the article.-->
'''Deep reinforcement learning''' ('''deep RL''') is a subfield of [[machine learning]] that combines [[reinforcement learning]] (RL) and [[deep learning]]. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the [[state space]]. Deep RL algorithms are able to take in very large inputs (e.g. every pixel rendered to the screen in a video game) and decide what actions to perform to optimize an objective (e.g. maximizing the game score). Deep reinforcement learning has been used for a diverse set of applications including but not limited to [[robotics]], [[video game]]s, [[natural language processing]], [[computer vision]],<ref>{{Cite journal |last1=Le |first1=Ngan |last2=Rathour |first2=Vidhiwar Singh |last3=Yamazaki |first3=Kashu |last4=Luu |first4=Khoa |last5=Savvides |first5=Marios |date=2022-04-01 |title=Deep reinforcement learning in computer vision: a comprehensive survey |url=https://doi.org/10.1007/s10462-021-10061-9 |journal=Artificial Intelligence Review |language=en |volume=55 |issue=4 |pages=2733–2819 |doi=10.1007/s10462-021-10061-9 |issn=1573-7462|arxiv=2108.11510 }}</ref> education, transportation, finance and [[Health care|healthcare]].<ref name="francoislavet2018"/>
 
== Overview ==
== Deep reinforcement learning ==
 
=== IntroductionDeep learning ===
[[File:Neural_network_example.svg|thumb|241x241px|Depiction of a basic artificial neural network]]
'''Deep reinforcement learning (DRL)''' is part of [[machine learning]], which combines [[reinforcement learning]] (RL) and [[deep learning]]. In DRL, agents learn how decisions are to be made by interacting with environments in order to maximize cumulative rewards, while using [[Artificial neural networks|deep neural networks]] to represent policies, value functions, or models of the environment. This integration enables agents to handle high-dimensional input spaces, such as raw images or continuous control signals, making DRL a widely used approach for addressing complex tasks.<ref name="Li2018">Li, Yuxi. "Deep Reinforcement Learning: An Overview." ''arXiv'' preprint arXiv:1701.07274 (2018). https://arxiv.org/abs/1701.07274</ref>
[[Deep learning]] is a form of [[machine learning]] that utilizes a neural network to transform a set of inputs into a set of outputs via an [[artificial neural network]]. Deep learning methods, often using [[supervised learning]] with labeled datasets, have been shown to solve tasks that involve handling complex, high-dimensional raw input data (such as images) with less manual [[feature engineering]] than prior methods, enabling significant progress in several fields including [[computer vision]] and [[natural language processing]]. In the past decade, deep RL has achieved remarkable results on a range of problems, from single and multiplayer games such as [[Go (game)|Go]], [[Atari Games]], and ''[[Dota 2]]'' to robotics.<ref>{{Cite web |last=Graesser |first=Laura |title=Foundations of Deep Reinforcement Learning: Theory and Practice in Python |url=https://openlibrary.telkomuniversity.ac.id/home/catalog/id/198650/slug/foundations-of-deep-reinforcement-learning-theory-and-practice-in-python.html |access-date=2023-07-01 |website=Open Library Telkom University}}</ref>
 
=== Reinforcement learning ===
Since the development of the [[Q-learning|deep Q-network (DQN)]] in 2015, DRL has led to major breakthroughs in domains such as [[Video game|games]], [[robotics]], and [[Autonomous system|autonomous systems]]. Research in DRL continues to expand rapidly, with active work on challenges like sample efficiency and robustness, as well as innovations in model-based methods, transformer architectures, and open-ended learning. Applications now range from healthcare and finance to language systems and autonomous vehicles.<ref name="Arul2017">Arulkumaran, Kai, et al. "A brief survey of deep reinforcement learning." ''arXiv'' preprint arXiv:1708.05866 (2017). https://arxiv.org/abs/1708.05866</ref>
[[File:Markov_diagram_v2.svg|alt=Diagram explaining the loop recurring in reinforcement learning algorithms|thumb|Diagram of the loop recurring in reinforcement learning algorithms]]
[[Reinforcement learning]] is a process in which an agent learns to make decisions through trial and error. This problem is often modeled mathematically as a [[Markov decision process]] (MDP), where an agent at every timestep is in a state <math>s</math>, takes action <math>a</math>, receives a scalar reward and transitions to the next state <math>s'</math> according to environment dynamics <math>p(s'|s, a)</math>. The agent attempts to learn a policy <math>\pi(a|s)</math>, or map from observations to actions, in order to maximize its returns (expected sum of rewards). In reinforcement learning (as opposed to [[optimal control]]) the algorithm only has access to the dynamics <math>p(s'|s, a)</math> through sampling.
 
=== BackgroundDeep reinforcement learning ===
In many practical decision-making problems, the states <math>s</math> of the MDP are high-dimensional (e.g., images from a camera or the raw sensor stream from a robot) and cannot be solved by traditional RL algorithms. Deep reinforcement learning algorithms incorporate deep learning to solve such MDPs, often representing the policy <math>\pi(a|s)</math> or other learned functions as a neural network and developing specialized algorithms that perform well in this setting.
 
== History ==
Reinforcement learning (RL) is a framework in which agents interact with environments by taking actions and learning from feedback in form of rewards or penalties. Traditional RL methods, such as [[Q-learning]] and policy gradient techniques, rely on tabular representations or linear approximations, which are often not scalable to high-dimensional or continuous input spaces.
 
Along with rising interest in neural networks beginning in the mid 1980s, interest grew in deep reinforcement learning, where a neural network is used in reinforcement learning to represent policies or value functions. Because in such a system, the entire decision making process from sensors to motors in a robot or agent involves a single [[neural network]], it is also sometimes called end-to-end reinforcement learning.<ref name="Hassabis"/> One of the first successful applications of reinforcement learning with neural networks was [[TD-Gammon]], a computer program developed in 1992 for playing [[backgammon]].<ref name="TD-Gammon"/> Four inputs were used for the number of pieces of a given color at a given ___location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level by self-play and [[temporal difference learning|TD(<math>\lambda</math>)]].
DRL came out as solution to above limitation by integrating RL and [[deep neural networks]]. This combination enables agents to approximate complex functions and handle unstructured input data like raw images, sensor data, or natural language. The approach became widely recognized following the success of DeepMind's deep Q-network (DQN), which achieved human-level performance on several Atari video games using only pixel inputs and game scores as feedback.<ref>Mnih, V. et al. "Human-level control through deep reinforcement learning." arXiv:1312.5602 (2013). https://arxiv.org/abs/1312.5602</ref>
 
Seminal textbooks by [[Richard S. Sutton|Sutton]] and [[Andrew Barto|Barto]] on reinforcement learning,<ref name="sutton1996"/> [[Dimitri Bertsekas|Bertsekas]] and [[John Tsitsiklis|Tsitiklis]] on neuro-dynamic programming,<ref name="tsitsiklis1996"/> and others<ref name="miller1990"/> advanced knowledge and interest in the field.
Since then, DRL has evolved to include various architectures and learning strategies, including model-based methods, actor-critic frameworks, and applications in continuous control environments.<ref>Li, Yuxi. "Deep Reinforcement Learning: An Overview." arXiv preprint arXiv:1701.07274 (2018). https://arxiv.org/abs/1701.07274</ref> These developments have significantly expanded the applicability of DRL across domains where traditional RL was limited.
 
Katsunari Shibata's group showed that various functions emerge in this framework,<ref name="Shibata3"/><ref name="Shibata4"/><ref name="Shibata2"/> including image recognition, color constancy, sensor motion (active recognition), hand-eye coordination and hand reaching movement, explanation of brain activities, knowledge transfer, memory,<ref name="Shibata5"/> selective attention, prediction, and exploration.<ref name="Shibata4"/><ref name="Shibata6"/>
=== Key Algorithms and Methods ===
 
Starting around 2012, the so-called [[Deep learning#Deep learning revolution |deep learning revolution]] led to an increased interest in using deep neural networks as function approximators across a variety of domains. This led to a renewed interest in researchers using deep neural networks to learn the policy, value, and/or Q functions present in existing reinforcement learning algorithms.
Several algorithmic approaches form the foundation of deep reinforcement learning, each with different strategies for learning optimal behavior.
 
Beginning around 2013, [[DeepMind]] showed impressive learning results using deep RL to play [[Atari]] video games.<ref name="DQN1"/><ref name="DQN2"/> The computer player a neural network trained using a deep RL algorithm, a deep version of [[Q-learning]] they termed deep Q-networks (DQN), with the game score as the reward. They used a deep [[convolutional neural network]] to process 4 frames RGB pixels (84x84) as inputs. All 49 games were learned using the same network architecture and with minimal prior knowledge, outperforming competing methods on almost all the games and performing at a level comparable or superior to a professional human game tester.<ref name="DQN2" />
One of the earliest and most influential DRL algorithms is the Deep Q-Network (DQN), which combines Q-learning with deep neural networks. DQN approximates the optimal action-value function using a convolutional neural network and introduced techniques such as experience replay and target networks which stabilize training.<ref>Mnih, V. et al. "Human-level control through deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). https://arxiv.org/abs/1312.5602</ref>
 
Deep reinforcement learning reached another milestone in 2015 when [[AlphaGo]],<ref name="AlphaGo"/> a computer program trained with deep RL to play [[Go (game)|Go]], became the first computer Go program to beat a human professional Go player without handicap on a full-sized 19×19 board.
* Policy gradient methods directly optimize the agent’s policy by adjusting parameters in the direction that increases expected rewards. These methods are well-suited to high-dimensional or continuous action spaces and form the basis of many modern DRL algorithms.<ref>Li, Yuxi. "Deep Reinforcement Learning: An Overview." arXiv preprint arXiv:1701.07274 (2018). https://arxiv.org/abs/1701.07274</ref>
In a subsequent project in 2017, [[AlphaZero]] improved performance on Go while also demonstrating they could use the same algorithm to learn to play [[chess]] and [[shogi]] at a level competitive or superior to existing computer programs for those games, and again improved in 2019 with [[MuZero]].<ref name="muzero"/> Separately, another milestone was achieved by researchers from [[Carnegie Mellon University]] in 2019 developing [[Pluribus (poker bot)|Pluribus]], a computer program to play [[poker]] that was the first to beat professionals at multiplayer games of no-limit [[Texas hold 'em]]. [[OpenAI Five]], a program for playing five-on-five ''[[Dota 2]]'' beat the previous world champions in a demonstration match in 2019.
* Actor-critic algorithms combine the advantages of value-based and policy-based methods. The actor updates the policy, while the critic evaluates the current policy using a value function. Popular variants include A2C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization), both of which are widely used in benchmarks and real-world applications.
 
Deep reinforcement learning has also been applied to many domains beyond games. In robotics, it has been used to let robots perform simple household tasks<ref name="levine2016"/> and solve a Rubik's cube with a robot hand.<ref name="openaihand"/><ref name="openaihandarxiv"/> Deep RL has also found sustainability applications, used to reduce energy consumption at data centers.<ref name="deepmindcooling"/> Deep RL for [[autonomous driving]] is an active area of research in academia and industry.<ref name="neurips2021ml4ad"/> [[Loon_LLC|Loon]] explored deep RL for autonomously navigating their high-altitude balloons.<ref name="loonrl"/>
Other methods include multi-agent reinforcement learning, hierarchical RL, and approaches that integrate planning or memory mechanisms, depending on the complexity of the task and environment.
[[File:Reinforcement learning diagram.svg|thumb|center|upright=1.2|Typical agent–environment interaction in reinforcement learning.]]
 
=== ApplicationsAlgorithms ===
Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.
 
In '''model-based''' deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by [[supervised learning]] using a neural network. Then, actions are obtained by using [[model predictive control]] using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using [[Monte Carlo methods]] such as the [[cross-entropy method]], or a combination of model-learning with model-free methods.
DRL has been applied to wide range of domains that require sequential decision-making and the ability to learn from high-dimensional input data.
 
In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>
One of the most well-known applications is in [[Video game|games]], where DRL agents have demonstrated performance comparable to or exceeding human-level benchmarks. DeepMind's AlphaGo and AlphaStar, as well as OpenAI Five, are notable examples of DRL systems mastering complex games such as [[Go (game)|Go]], [[StarCraft II]], and [[Dota 2]].<ref>Arulkumaran, K. et al. "A brief survey of deep reinforcement learning." arXiv preprint arXiv:1708.05866 (2017). https://arxiv.org/abs/1708.05866</ref>While these systems have demonstrated high performance in constrained environments, their success often depends on extensive computational resources and may not generalize easily to tasks outside their training domains.
 
== Research ==
In [[robotics]], DRL has been used to train agents for tasks such as locomotion, manipulation, and navigation in both simulated and real-world environments. By learning directly from sensory input, DRL enables robots to adapt to complex dynamics without relying on hand-crafted control rules.<ref>Li, Yuxi. "Deep Reinforcement Learning: An Overview." arXiv preprint arXiv:1701.07274 (2018). https://arxiv.org/abs/1701.07274</ref>
Deep reinforcement learning is an active area of research, with several lines of inquiry.
 
=== Exploration ===
Other growing areas of application include [[finance]] (e.g., portfolio optimization), [[healthcare]] (e.g., treatment planning and medical decision-making), [[natural language processing]] (e.g., dialogue systems), and [[autonomous vehicles]] (e.g., path planning and control).All of these applications shows how DRL deals with real-world problems like uncertainty, sequential reasoning, and high-dimensional data.<ref>OpenAI et al. "Open-ended learning leads to generally capable agents." arXiv preprint arXiv:2302.06622 (2023). https://arxiv.org/abs/2302.06622</ref>
=== Challenges and Limitations ===
 
An RL agent must balance the exploration/exploitation tradeoff: the problem of deciding whether to pursue actions that are already known to yield high rewards or explore other actions in order to discover higher rewards. RL agents usually collect data with some type of stochastic policy, such as a [[Boltzmann distribution]] in discrete action spaces or a [[Normal distribution|Gaussian distribution]] in continuous action spaces, inducing basic exploration behavior. The idea behind novelty-based, or curiosity-driven, exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify[ing] the loss function (or even the network architecture) by adding terms to incentivize exploration".<ref>{{cite book|last1=Reizinger|first1=Patrik|last2=Szemenyei|first2=Márton|date=2019-10-23|title=ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)|chapter=Attention-Based Curiosity-Driven Exploration in Deep Reinforcement Learning |pages=3542–3546 |doi=10.1109/ICASSP40776.2020.9054546 |arxiv=1910.10840|isbn=978-1-5090-6631-5 |s2cid=204852215 }}</ref> An agent may also be aided in exploration by utilizing demonstrations of successful trajectories, or reward-shaping, giving an agent intermediate rewards that are customized to fit the task it is attempting to complete.<ref>{{Citation|last=Wiewiora|first=Eric|title=Reward Shaping|date=2010|url=https://doi.org/10.1007/978-0-387-30164-8_731|encyclopedia=Encyclopedia of Machine Learning|pages=863–865|editor-last=Sammut|editor-first=Claude|place=Boston, MA|publisher=Springer US|language=en|doi=10.1007/978-0-387-30164-8_731|isbn=978-0-387-30164-8|access-date=2020-11-16|editor2-last=Webb|editor2-first=Geoffrey I.|url-access=subscription}}</ref>
DRL has several significant challenges which limit its broader deployment.
 
=== Off-policy reinforcement learning ===
One of the most prominent issues is sample inefficiency. DRL algorithms often require millions of interactions with the environment to learn effective policies, which is impractical in many real-world settings where data collection is expensive or time-consuming.<ref>Li, Yuxi. "Deep Reinforcement Learning: An Overview." arXiv preprint arXiv:1701.07274 (2018). https://arxiv.org/abs/1701.07274</ref>
 
An important distinction in RL is the difference between on-policy algorithms that require evaluating or improving the policy that collects data, and off-policy algorithms that can learn a policy from data generated by an arbitrary policy. Generally, value-function based methods such as [[Q-learning]] are better suited for off-policy learning and have better sample-efficiency - the amount of data required to learn a task is reduced because data is re-used for learning. At the extreme, offline (or "batch") RL considers learning a policy from a fixed dataset without additional interaction with the environment.
Another challenge is sparse or delayed reward problem, where feedback signals are infrequent, which makes it difficult for agents to attribute outcomes to specific decisions. Techniques such as reward shaping and exploration strategies have been developed to address this issue.<ref>Arulkumaran, K. et al. "A brief survey of deep reinforcement learning." arXiv preprint arXiv:1708.05866 (2017). https://arxiv.org/abs/1708.05866</ref>
 
=== Inverse reinforcement learning ===
DRL systems also tend to be sensitive to hyperparameters and lack robustness across tasks or environments. Models that are trained in simulation fail very often when deployed in the real world due to discrepancies between simulated and real-world dynamics, a problem known as the "reality gap."Bias and fairness in DRL systems have also emerged as concerns, particularly in domains like healthcare and finance where imbalanced data can lead to unequal outcomes for underrepresented groups.
 
Inverse RL refers to inferring the reward function of an agent given the agent's behavior. Inverse reinforcement learning can be used for learning from demonstrations (or [[apprenticeship learning]]) by inferring the demonstrator's reward and then optimizing a policy to maximize returns with RL. Deep learning approaches have been used for various forms of imitation learning and inverse RL.<ref name="deepirl"/>
Additionally, concerns about safety, interpretability, and reproducibility have become increasingly important, especially in high-stakes domains such as healthcare or autonomous driving. These issues remain active areas of research in the DRL community.
 
=== Goal-conditioned reinforcement learning ===
=== Recent Advances ===
 
Another active area of research is in learning goal-conditioned policies, also called contextual or universal policies <math>\pi(a|s, g)</math> that take in an additional goal <math>g</math> as input to communicate a desired aim to the agent.<ref name="schaul2015uva"/> Hindsight experience replay is a method for goal-conditioned RL that involves storing and learning from previous failed attempts to complete a task.<ref name="andrychowicz2017her"/> While a failed attempt may not have reached the intended goal, it can serve as a lesson for how achieve the unintended result through hindsight relabeling.
Recent developments in DRL have introduced new architectures and training strategies which aims to improving performance, efficiency, and generalization.
 
=== Multi-agent reinforcement learning ===
One key area of progress is model-based reinforcement learning, where agents learn an internal model of the environment to simulate outcomes before acting. This kind off approach improves sample efficiency and planning. An example is the Dreamer algorithm, which learns a latent space model to train agents more efficiently in complex environments.<ref>Hafner, D. et al. "Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:1912.01603 (2019). https://arxiv.org/abs/1912.01603</ref>
 
Many applications of reinforcement learning do not involve just a single agent, but rather a collection of agents that learn together and co-adapt. These agents may be competitive, as in many games, or cooperative as in many real-world multi-agent systems. [[Multi-agent reinforcement learning]] studies the problems introduced in this setting.
Another major innovation is the use of transformer-based architectures in DRL. Unlike traditional models that rely on recurrent or convolutional networks, transformers can model long-term dependencies more effectively. The Decision Transformer and other similar models treat RL as a sequence modeling problem, enabling agents to generalize better across tasks.<ref>Kostas, J. et al. "Transformer-based reinforcement learning agents." arXiv preprint arXiv:2209.00588 (2022). https://arxiv.org/abs/2209.00588</ref>
 
=== Generalization ===
In addition, research into open-ended learning has led to the creation of  capable agents that are able to solve a range of tasks without task-specific tuning. Similar systems like the ones that are developed by OpenAI show that agents trained in diverse, evolving environments can generalize across new challenges, moving toward more adaptive and flexible intelligence.<ref>OpenAI et al. "Open-ended learning leads to generally capable agents." arXiv preprint arXiv:2302.06622 (2023). https://arxiv.org/abs/2302.06622</ref>
 
The promise of using deep learning tools in reinforcement learning is generalization: the ability to operate correctly on previously unseen inputs. For instance, neural networks trained for image recognition can recognize that a picture contains a bird even it has never seen that particular image or even that particular bird. Since deep RL allows raw data (e.g. pixels) as input, there is a reduced need to predefine the environment, allowing the model to be generalized to multiple applications. With this layer of abstraction, deep reinforcement learning algorithms can be designed in a way that allows them to be general and the same model can be used for different tasks.<ref name="packer2019"/> One method of increasing the ability of policies trained with deep RL policies to generalize is to incorporate [[representation learning]].
=== Future Directions ===
 
== References ==
As deep reinforcement learning continues to evolve, researchers are exploring ways to make algorithms more efficient, robust, and generalizable across a wide range of tasks. Improving sample efficiency through model-based learning, enhancing generalization with open-ended training environments, and integrating foundation models are among the current research goals.
<!-- Inline citations added to your article will automatically display here. See en.wikipedia.org/wiki/WP:REFB for instructions on how to add citations. -->
<references>
<ref name="packer2019">{{cite arXiv|last1=Packer|first1=Charles|last2=Gao|first2=Katelyn|last3=Kos|first3=Jernej|last4=Krähenbühl|first4=Philipp|last5=Koltun|first5=Vladlen|last6=Song|first6=Dawn|date=2019-03-15|title=Assessing Generalization in Deep Reinforcement Learning|class=cs.LG|eprint=1810.12282}}</ref>
<ref name="francoislavet2018">{{Cite journal|last1=Francois-Lavet|first1=Vincent|last2=Henderson|first2=Peter|last3=Islam|first3=Riashat|last4=Bellemare|first4=Marc G.|last5=Pineau|first5=Joelle|date=2018|title=An Introduction to Deep Reinforcement Learning|journal=Foundations and Trends in Machine Learning|volume=11|issue=3–4|pages=219–354|arxiv=1811.12560|bibcode=2018arXiv181112560F|doi=10.1561/2200000071|issn=1935-8237|s2cid=54434537}}</ref>
<ref name="Hassabis">{{cite speech |last1=Demis |first1=Hassabis | date=March 11, 2016 |title= Artificial Intelligence and the Future. |url= https://www.youtube.com/watch?v=8Z2eLTSCuBk}}</ref>
<ref name="TD-Gammon">{{cite journal | title=Temporal Difference Learning and TD-Gammon | date=March 1995 | last=Tesauro | first=Gerald | journal=Communications of the ACM | volume=38 | issue=3 | doi=10.1145/203330.203343 | pages=58–68 | s2cid=8763243 | doi-access=free }}</ref>
<ref name="sutton1996">{{cite book |last1=Sutton |first1=Richard |last2=Barto |first2=Andrew |date=September 1996 |title=Reinforcement Learning: An Introduction |publisher=Athena Scientific}}</ref>
<ref name="tsitsiklis1996">{{cite book |last1=Bertsekas |first2=Dimitri |last2=Tsitsiklis |first1=John |date=September 1996 |title=Neuro-Dynamic Programming |url=http://athenasc.com/ndpbook.html |publisher=Athena Scientific |isbn=1-886529-10-8}}</ref>
<ref name="miller1990">{{cite book |last1=Miller |first1=W. Thomas |last2=Werbos |first2=Paul |last3=Sutton |first3=Richard |date=1990 |title=Neural Networks for Control}}</ref>
<ref name="Shibata3">{{cite conference |first1= Katsunari |last1= Shibata |first2= Yoichi |last2= Okabe |year= 1997 |title= Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICNN97.pdf |conference= International Conference on Neural Networks (ICNN) 1997 |access-date= 2020-12-01 |archive-date= 2020-12-09 |archive-url= https://web.archive.org/web/20201209090005/http://shws.cc.oita-u.ac.jp/~shibata/pub/ICNN97.pdf |url-status= dead }}</ref>
<ref name="Shibata4">{{cite conference |first1= Katsunari |last1= Shibata |first2= Masaru |last2= Iida |year= 2003 |title= Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/SICE03.pdf |conference= SICE Annual Conference 2003 |access-date= 2020-12-01 |archive-date= 2020-12-09 |archive-url= https://web.archive.org/web/20201209052433/http://shws.cc.oita-u.ac.jp/~shibata/pub/SICE03.pdf |url-status= dead }}</ref>
<ref name="Shibata2">{{cite arXiv |last=Shibata |first=Katsunari |title=Functions that Emerge through End-to-End Reinforcement Learning | date=March 7, 2017 |eprint=1703.02239 |class=cs.AI }}</ref>
<ref name="Shibata5">{{cite conference |first1= Hiroki |last1= Utsunomiya |first2= Katsunari |last2= Shibata |year= 2008 |title= Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP08Utsunomiya.pdf |conference= International Conference on Neural Information Processing (ICONIP) '08 |access-date= 2020-12-14 |archive-date= 2017-08-10 |archive-url= https://web.archive.org/web/20170810040023/http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP08Utsunomiya.pdf |url-status= dead }}</ref>
<ref name="Shibata6">{{cite conference |first1= Katsunari |last1= Shibata |first2= Tomohiko |last2= Kawano |year= 2008 |title= Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP98.pdf |conference= International Conference on Neural Information Processing (ICONIP) '08 |access-date= 2020-12-01 |archive-date= 2020-12-11 |archive-url= https://web.archive.org/web/20201211110750/http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP98.pdf |url-status= dead }}</ref>
<ref name="DQN1">{{cite conference |first= Volodymyr|display-authors=etal|last= Mnih |date=December 2013 |title= Playing Atari with Deep Reinforcement Learning |url= https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf |conference= NIPS Deep Learning Workshop 2013}}</ref>
<ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid=25719670|bibcode=2015Natur.518..529M |s2cid=205242740}}</ref>
<ref name="AlphaGo">{{Cite journal|title = Mastering the game of Go with deep neural networks and tree search|journal = [[Nature (journal)|Nature]]| issn= 0028-0836|pages = 484–489|volume = 529|issue = 7587|doi = 10.1038/nature16961|pmid = 26819042|first1 = David|last1 = Silver|author-link1=David Silver (programmer)|first2 = Aja|last2 = Huang|author-link2=Aja Huang|first3 = Chris J.|last3 = Maddison|first4 = Arthur|last4 = Guez|first5 = Laurent|last5 = Sifre|first6 = George van den|last6 = Driessche|first7 = Julian|last7 = Schrittwieser|first8 = Ioannis|last8 = Antonoglou|first9 = Veda|last9 = Panneershelvam|first10= Marc|last10= Lanctot|first11= Sander|last11= Dieleman|first12=Dominik|last12= Grewe|first13= John|last13= Nham|first14= Nal|last14= Kalchbrenner|first15= Ilya|last15= Sutskever|author-link15=Ilya Sutskever|first16= Timothy|last16= Lillicrap|first17= Madeleine|last17= Leach|first18= Koray|last18= Kavukcuoglu|first19= Thore|last19= Graepel|first20= Demis |last20=Hassabis|author-link20=Demis Hassabis|date= 28 January 2016|bibcode = 2016Natur.529..484S|s2cid = 515925}}{{closed access}}</ref>
<ref name="levine2016">{{Cite journal |last1=Levine |first1=Sergey |last2=Finn |first2=Chelsea |author-link2=Chelsea Finn |last3=Darrell |first3=Trevor |last4=Abbeel |first4=Pieter |date=January 2016 |title=End-to-end training of deep visuomotor policies |url=https://www.jmlr.org/papers/volume17/15-389/15-389.pdf |journal=JMLR |volume=17 |arxiv=1504.00702}}</ref>
<ref name="openaihand">{{Cite web|title=OpenAI - Solving Rubik's Cube With A Robot Hand|url=https://openai.com/blog/solving-rubiks-cube/|website=OpenAI|date=5 January 2021 }}</ref>
<ref name="openaihandarxiv">{{Cite conference|title= Solving Rubik's Cube with a Robot Hand |last1=OpenAI |display-authors=etal|date=2019|arxiv=1910.07113 }}</ref>
<ref name="deepmindcooling">{{Cite web|title=DeepMind AI Reduces Google Data Centre Cooling Bill by 40% |url=https://deepmind.com/blog/article/deepmind-ai-reduces-google-data-centre-cooling-bill-40|website=DeepMind|date=14 May 2024 }}</ref>
<ref name="neurips2021ml4ad">{{Cite web|title=Machine Learning for Autonomous Driving Workshop @ NeurIPS 2021|url=https://ml4ad.github.io/|website=NeurIPS 2021|date=December 2021}}</ref>
<ref name="williams1992">{{Cite journal|last1=Williams|first1=Ronald J|journal=Machine Learning|pages=229–256|title = Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning|date=1992|volume=8|issue=3–4|doi=10.1007/BF00992696|s2cid=2332513|doi-access=free}}</ref>
<ref name="schulman2017ppo">{{Cite conference|title=Proximal Policy Optimization Algorithms |last1=Schulman|first1=John|last2=Wolski|first2=Filip|last3=Dhariwal|first3=Prafulla|last4=Radford|first4=Alec|last5=Klimov|first5=Oleg|date=2017|arxiv=1707.06347}}</ref>
<ref name="schulman2015trpo">{{Cite conference|title=Trust Region Policy Optimization |last1=Schulman|first1=John|last2=Levine|first2=Sergey|last3=Moritz|first3=Philipp|last4=Jordan|first4=Michael|last5=Abbeel|first5=Pieter|date=2015|arxiv=1502.05477|conference=International Conference on Machine Learning (ICML)}}</ref>
<ref name="lillicrap2015ddpg">{{Cite conference|title=Continuous control with deep reinforcement learning |last1=Lillicrap|first1=Timothy |last2=Hunt|first2=Jonathan |last3=Pritzel|first3=Alexander |last4=Heess|first4=Nicolas |last5=Erez|first5=Tom |last6=Tassa|first6=Yuval |last7=Silver|first7=David |last8=Wierstra|first8=Daan |conference=International Conference on Learning Representations (ICLR)|date=2016|arxiv=1509.02971}}</ref>
<ref name="mnih2016a3c">{{Cite conference|title=Asynchronous Methods for Deep Reinforcement Learning |last1=Mnih|first1=Volodymyr |last2=Puigdomenech Badia|first2=Adria |last3=Mirzi|first3=Mehdi |last4=Graves|first4=Alex |last5=Harley|first5=Tim |last6=Lillicrap|first6=Timothy |last7=Silver|first7=David |last8=Kavukcuoglu|first8=Koray |conference=International Conference on Machine Learning (ICML)|date=2016|arxiv=1602.01783}}</ref>
<ref name="haarnoja2018sac">{{Cite conference|title=Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor |last1=Haarnoja|first1=Tuomas |last2=Zhou|first2=Aurick |last3=Levine|first3=Sergey |last4=Abbeel|first4=Pieter |conference=International Conference on Machine Learning (ICML)|date=2018|arxiv=1801.01290}}</ref>
<ref name="andrychowicz2017her">{{Cite conference| last1=Andrychowicz|first1=Marcin| last2=Wolski|first2=Filip| last3=Ray|first3=Alex| last4=Schneider|first4=Jonas |last5=Fong|first5=Rachel |last6=Welinder|first6=Peter |last7=McGrew|first7=Bob|last8=Tobin|first8=Josh|last9=Abbeel|first9=Pieter|last10=Zaremba|first10=Wojciech|date=2018|title=Hindsight Experience Replay|arxiv=1707.01495|conference=Advances in Neural Information Processing Systems (NeurIPS)}}</ref>
<ref name="schaul2015uva">{{Cite conference| title=Universal Value Function Approximators|last1=Schaul|first1=Tom |last2=Horgan|first2=Daniel |last3=Gregor|first3=Karol |last4=Silver|first4=David |conference=International Conference on Machine Learning (ICML) |date=2015| url=http://proceedings.mlr.press/v37/schaul15.html}}</ref>
<ref name="muzero">{{cite journal |last1=Schrittwieser |first1=Julian |last2=Antonoglou |first2=Ioannis |last3=Hubert |first3=Thomas |last4=Simonyan |first4=Karen |last5=Sifre |first5=Laurent |last6=Schmitt |first6=Simon |last7=Guez |first7=Arthur |last8=Lockhart |first8=Edward |last9=Hassabis |first9=Demis |last10=Graepel |first10=Thore |last11=Lillicrap |first11=Timothy |last12=Silver |first12=David |title=Mastering Atari, Go, chess and shogi by planning with a learned model |journal=Nature |date=23 December 2020 |volume=588 |issue=7839 |pages=604–609 |doi=10.1038/s41586-020-03051-4 |pmid=33361790 |url=https://www.nature.com/articles/s41586-020-03051-4|arxiv=1911.08265 |bibcode=2020Natur.588..604S |s2cid=208158225 }}</ref>
<ref name="loonrl">{{cite journal |last1=Bellemare |first1=Marc |last2=Candido |first2=Salvatore |last3=Castro |first3=Pablo |last4=Gong |first4=Jun |last5=Machado |first5=Marlos |last6=Moitra |first6=Subhodeep |last7=Ponda |first7=Sameera |last8=Wang |first8=Ziyu |title=Autonomous navigation of stratospheric balloons using reinforcement learning |journal=Nature |date=2 December 2020 |volume=588 |issue=7836 |pages=77–82 |doi=10.1038/s41586-020-2939-8 |pmid=33268863 |bibcode=2020Natur.588...77B |s2cid=227260253 |url=https://www.nature.com/articles/s41586-020-2939-8|url-access=subscription }}</ref>
<ref name="deepirl">{{cite arXiv| last1=Wulfmeier|first1=Markus|last2=Ondruska|first2=Peter|last3=Posner|first3=Ingmar|date=2015|title= Maximum Entropy Deep Inverse Reinforcement Learning |class=cs.LG|eprint=1507.04888}}</ref>
</references>
 
[[Category:Machine learning algorithms]]
similar area of interest is safe and ethical deployment, particularly in high-risk settings like healthcare, autonomous driving, and finance. Researchers are developing frameworks for safer exploration, interpretability, and better alignment with human values.Ensuring that DRL systems promote equitable outcomes remains an ongoing challenge, especially where historical data may under‑represent marginalized populations.
[[Category:Reinforcement learning]]
 
[[Category:Deep learning]]
The future of DRL may also involve more integration with other subfields of machine learning, such as unsupervised learning, transfer learning, and large language models, enabling agents that can learn from diverse data modalities and interact more naturally with human users.<ref>OpenAI et al. "Open-ended learning leads to generally capable agents." arXiv preprint arXiv:2302.06622 (2023). https://arxiv.org/abs/2302.06622</ref>
 
=== References ===
<references />
 
[[Category:Wikipedia Student Program]]