![]() | This is not a Wikipedia article: It is an individual user's work-in-progress page, and may be incomplete and/or unreliable. For guidance on developing this draft, see Wikipedia:So you made a userspace draft. Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Deep reinforcement learning (DRL) is a machine learning method that takes principles from both reinforcement learning and deep learning to obtain benefits from both.
Deep reinforcement learning has a large diversity of applications including video games, computer science, healthcare, and finance. Games in particular have been very influential in the development of reinforcement learning algorithms because they are able to take a huge amount of input data (e.g. every pixel rendered to the screen in a video game) and decide what action needs to take place in order to reach a goal.
Overview
Reinforcement Learning
Reinforcement learning is a process in which an agent learns to preform an action through trial and error.[https://arxiv.org/abs/2001.00119] In this process, the agent receives a reward indicating whether their previous action was good or bad and aims to optimize their behavior based on this reward.
https://arxiv.org/abs/2001.00119
Deep Learning
Deep Learning is a form of machine learning that utilizes a neural network to to transform a set of inputs into a set of outputs via an artificial neural network.
Deep Reinforcement Learning
Deep Reinforcement Learning combines both the techniques of giving rewards based on actions from reinforcement learning and the idea of using a neural network to process data from deep learning.
Applications
Deep reinforcement learning has been used for a variety of applications in the past, some of which include:
- The AlphaZero algorithm, developed by DeepMind, that has achieved super-human like performance in many games.
- Image enhancement models such as GAN and Unet which have attained much higher performance compared to the previous methods such as super-resolution and segmentation[1]
Training
In order to have a functional agent, the algorithm must be trained with a certain goal. There are different techniques used to train agents, each having their own benefits.
Basics of training using DRL
Q-Learning
Q-learning networks are learning algorithms without a specified model that analyze a situation and produce an action the agent should take.
Q-learning attempts to determine the optimal action given a specific state. The way this method determines the Q value, or quality of an action, can be loosely defined by a function taking in a state "s" and and action "a" and outputting the perceived quality of that action:
Q(s,a)
The training process of Q-learning involves exploring different actions and recording the table of q values that correspond state and actions. Once this agent is sufficiently trained, the table should provide a good representation of the quality of actions given their state.
Deep Q-Learning
Deep Q-Learning takes the principles of standard Q-learning but approximates the q values using an artificial neural network. In many applications, there is far too much input data that needs to be accounted for (e.g. the millions of pixels in a computer screen) which would make the standard process of determining quality values attached to states and actions take a long amount of time. By using a neural network to process the data and predict a q value for each available action, the algorithms can be much more efficient.
(approximating the q values with an artificial neural network)
Challenges
There are many factors that cause problems in training using the reinforcement learning method, some of which are listed below:
Exploration exploitation dilemma
The exploration exploitation dilemma is the problem of deciding whether to pursue actions that are already known to yield success or explore other pathways in order to discover greater success. There are two main approaches to learning policies to solve this problem, greedy, and epsilon-greedy.
In the greedy learning policy the agent chooses actions that maximize the q value.
a = argmax_(a) Q(s,a)
https://search-proquest-com.proxy.library.ucsb.edu:9443/docview/1136383063?accountid=14522
Frequency of rewards
In training reinforcement learning algorithms, agents are rewarded based on their behavior. Variation in the frequency and what occasions that the agent is awarded at can have a large impact on the speed and quality of the outcome of training.
When the goal is too difficult for the learning algorithm to complete, they may never reach the goal and will never be rewarded. Additionally, if a reward is received at the end of a task, the algorithm has no way to differentiate between good and bad behavior during the task. For example, if an algorithm is attempting to learn how to play pong and they make many correct moves but they ultimately loose the point and they are rewarded negatively, there is no way to determine what movements of the paddle were good and what moves were not good due to the reward being too sparse. (https://arxiv.org/abs/2001.00119 )
Optimizations
Reward Shaping
Reward shaping is the process of giving an agent intermediate rewards while it learns that are customized to fit the task. For example, if an agent is attempting to learn the game Atari Breakout , they may get a positive reward every time they successfully hit the ball and break a brick instead of successfully completing a level. This will reduce the time it takes an agent to learn a task because it will have to do less guessing. However, using this method reduces the generalizability of the algorithm because the reward triggers would need to be tweaked for each individual circumstance, making it not an optimal solution. (https://arxiv.org/abs/1903.02020)
Curiosity Driven Exploration
The idea behind curiosity driven exploration is to "modify the loss function (or even the network architecture) by adding terms to incentivize exploration"(https://arxiv.org/abs/1910.10840).
Hindsight Experience Replay
Hindsight experience replay is the method of training that involves storing and learning from previous failed attempts to complete a task beyond just a negative reward. While a failed attempt may not have reached the intended goal, it can serve as a lesson for how achieve the unintended result. (https://arxiv.org/abs/1707.01495)
Generalization
One thing deep reinforcement learning excels at is generalization, or the ability to use one machine learning model for multiple tasks.
When using reinforcement learning, the model must be aware of its environment which is usually provided manually but when this is combined with deep learning, which is very good at dictating features from raw data (e.g. pixels or raw image files) the algorithm gets the benefits of reinforcement learning without being told what it's environment looks like. With this layer of abstraction, deep reinforcement learning algorithms can become generalized and the same model can be used for different tasks. Automatic feature extraction can provide much better accuracy than if a human to do this job[2]
https://ucsb-primo.hosted.exlibrisgroup.com/permalink/f/12e9sm9/TN_arxiv1810.12282
References
- ^ Deep reinforcement learning fundamentals, research and applications. Dong, Hao., Ding, Zihan., Zhang, Shanghang. Singapore: Springer. 2020. ISBN 978-981-15-4095-0. OCLC 1163522253.
{{cite book}}
: CS1 maint: others (link) - ^ "https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_proquest2074058918&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US&context=PC". ucsb-primo.hosted.exlibrisgroup.com. Retrieved 2020-10-22.
{{cite web}}
: External link in
(help)|title=