User:ZachsGenericUsername/sandbox/Deep reinforcement learning: Difference between revisions

Content deleted Content added
restructructured headings
Citation bot (talk | contribs)
Removed URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Eastmain | #UCB_webform 653/859
 
(13 intermediate revisions by 2 users not shown)
Line 3:
'''Deep reinforcement learning (DRL)''' is a [[machine learning]] method that takes principles from both [[reinforcement learning]] and [[deep learning]] to obtain benefits from both.
 
Deep reinforcement learning has a large diversity of applications including but not limited to video games, computer science, healthcare, and finance. becauseDeep ofreinforcement howalgorithms powerfulare itable isto astake a machinehuge learningamount techniqueof input data (e.g. Gamesevery pixel rendered to the screen in particulara havevideo beengame) veryand influentialdecide inwhat theaction developmentneeds ofto reinforcementtake learningplace algorithmsin order to reach a goal.<ref name=":1" />
 
== Overview ==
 
=== Reinforcement Learning ===
[[File:Markov diagram v2.svg|alt=Diagram explaining the loop recurring in reinforcement learning algorithms|thumb|Diagram of the loop recurring in reinforcement learning algorithms]][[Reinforcement learning]] is a process in which an agent learns to preform an action through trial and error.<nowiki>[https://arxiv.org/abs/2001.00119]</nowiki> In this process, the agent receives a reward indicating whether their previous action was good or bad and aims to optimize their behavior based on this reward.<ref>{{Cite journal|last1=Parisi|first1=Simone|last2=Tateo|first2=Davide|last3=Hensel|first3=Maximilian|last4=D'Eramo|first4=Carlo|last5=Peters|first5=Jan|last6=Pajarinen|first6=Joni|date=2019-12-31|title=Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning|journal=Algorithms |volume=15 |issue=3 |page=81 |doi=10.3390/a15030081 |doi-access=free |arxiv=2001.00119 }}</ref>
 
<nowiki>https://arxiv.org/abs/2001.00119</nowiki>
 
=== Deep Learning ===
[[File:Neural network example.svg|thumb|241x241px|Depiction of a basic artificial neural network]]
[[Deep learning|Deep Learning]] is a form of machine learning that utilizes a neural network to to transform a set of inputs into a set of outputs via an [[artificial neural network]].
 
=== Deep Reinforcement Learning ===
Deep Reinforcementreinforcement Learninglearning combines both the techniques of giving rewards based on actions from reinforcement learning and the idea of using a neural network to process data from deep learning.
 
== Applications ==
Deep reinforcement learning has been used for a variety of applications in the past, some of which include:
 
* The [[AlphaZero]] algorithm, developed by [[DeepMind]], that has achieved super-human like performance in many games.<ref>{{Cite web|title=DeepMind - What if solving one problem could unlock solutions to thousands more?|url=https://deepmind.com/|access-date=2020-11-16|website=Deepmind}}</ref>
=== Video Games ===
* Image enhancement models such as [[Generative adversarial network|GAN]] and Unet which have attained much higher performance compared to the previous methods such as [[Super-resolution imaging|super-resolution]] and segmentation<ref name=":1">{{Cite book|url=https://www.worldcat.org/oclc/1163522253|title=Deep reinforcement learning fundamentals, research and applications|date=2020|publisher=Springer|others=Dong, Hao., Ding, Zihan., Zhang, Shanghang.|isbn=978-981-15-4095-0|___location=Singapore|oclc=1163522253}}</ref>
*
 
== Training ==
==== Autonomously Players ====
In order to have a functional agent, the algorithm must be trained with a certain goal. There are different techniques used to train agents, each having their own benefits.
The [[AlphaZero]] algorithm, developed by [[DeepMind]], that has achieved super-human like performance in many games.
 
==== ProceduralQ-Learning level generation ====
[[Deep Q-learning|Deep Q]] networks, orare learning algorithms without a specified model that analyze a situation and produce an action the agent should take.
Procedural level generation in video games <ref>{{Cite web|last=|first=|date=|title=Fix me :(|url=https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_proquest2073529169&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US&context=PC|url-status=live|archive-url=|archive-date=|access-date=2020-10-29|website=ucsb-primo.hosted.exlibrisgroup.com|language=en}}</ref> https://asmedigitalcollection.asme.org/computingengineering/article-abstract/20/5/051005/1074423/Deep-Reinforcement-Learning-for-Procedural-Content?redirectedFrom=fulltext
 
[[Q-learning]] attempts to determine the optimal action given a specific state. The way this method determines the Q value, or quality of an action, can be loosely defined by a function taking in a state "s" and and action "a" and outputting the perceived quality of that action:
=== Image Enhancement ===
Image enhancement models such as GAN and Unet which have attained much higher performance compared to the previous methods such as [[Super-resolution imaging|super-resolution]] and segmentation<ref>{{Cite book|url=https://www.worldcat.org/oclc/1163522253|title=Deep reinforcement learning fundamentals, research and applications|date=2020|publisher=Springer|others=Dong, Hao., Ding, Zihan., Zhang, Shanghang.|isbn=978-981-15-4095-0|___location=Singapore|oclc=1163522253}}</ref>
 
 
<math>Q(s,a)</math>
[[Deep Q-learning|Deep Q]] networks, or learning algorithms without a specified model that analyze a situation and produce an action the agent should take.
 
 
== Training ==
The training process of Q-learning involves exploring different actions and recording the table of q values that correspond state and actions. Once this agent is sufficiently trained, the table should provide an accurate representation of the quality of actions given their state.<ref>{{Cite web|last=Violante|first=Andre|date=2019-07-01|title=Simple Reinforcement Learning: Q-learning|url=https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56|access-date=2020-11-16|website=Medium|language=en}}</ref>
In order to have a functional agent, the algorithm must be trained with a certain goal. There are different techniques used to train agents, each having their own benefits.
==== Deep Q-Learning ====
[[Deep Q-learning]] takes the principles of standard Q-learning but approximates the q values using an artificial neural network. In many applications, there is too much input data that needs to be accounted for (e.g. the millions of pixels in a computer screen) which would make the standard process of determining every the q values for each state and action take a large amount of time. By using a neural network to process the data and predict a q value for each available action, the algorithms can be much faster and subsequently, process more data.<ref>{{Cite arXiv|last1=Ong|first1=Hao Yi|last2=Chavez|first2=Kevin|last3=Hong|first3=Augustus|date=2015-10-15|title=Distributed Deep Q-Learning|class=cs.LG |eprint=1508.04186 }}</ref>
 
=== Challenges ===
There are many factors that cause problems in training using the reinforcement learning method, some of which are listed below:
 
==== '''Exploration exploitationExploitation dilemmaDilemma''' ====
The exploration exploitation dilemma is the problem of deciding whether to pursue actions that are already known to yield success or explore other actions in order to discover greater success.
<nowiki>https://search-proquest-com.proxy.library.ucsb.edu:9443/docview/1136383063?accountid=14522</nowiki>
 
In the greedy learning policy the agent chooses actions that have the greatest the q value for the given state:<math>a=argmax_n Q(s,a)</math>With this solution, the agent may get stuck in a local maximum and not discover possible greater success because it only focuses maximizing the q value given its current knowledge.
==== '''Frequency of rewards''' ====
<u>ADD A LEAD IN TO WHAT BEING REWARDED MEANS</u>
 
In the epsilon-greedy method of training, before determining each action the agent decides whether to prioritize exploration, taking an action with an uncertain outcome for the purpose of gaining more knowledge, or exploitation, picking an action that maximize the q value. At every iteration, a random number between zero and one is selected. If this value is above the specified value of epsilon, the agent will choose a value that prioritizes exploration, otherwise the agent will select an action attempting to maximize the q value. Higher values of epsilon will result in a greater amount of exploration.<ref name=":0">{{Cite journal|last1=Voytenko|first1=S. V.|last2=Galazyuk|first2=A. V.|date=February 2007|title=Intracellular recording reveals temporal integration in inferior colliculus neurons of awake bats|url=https://pubmed.ncbi.nlm.nih.gov/17135472|journal=Journal of Neurophysiology|volume=97|issue=2|pages=1368–1378|doi=10.1152/jn.00976.2006|issn=0022-3077|pmid=17135472}}</ref>
When the goal is too difficult for the learning algorithm to complete, <s>they</s> may never reach the goal and will never be rewarded. Additionally, if a reward is received at the end of a task, the algorithm has no way to differentiate between good and bad behavior during the task. For example, if an algorithm is attempting to learn how to play pong and they make many correct moves but they ultimately loose the point and they are rewarded negatively, there is no way to determine what movements of the paddle were good and what moves were not good due to the reward being too sparse. (<nowiki>https://arxiv.org/abs/2001.00119</nowiki> )
 
 
==== '''[[Bias–variance tradeoff]]''' ====
<math>a = \begin{cases} rand(a_n) & \text{rand(0,1)}\leq\xi \\ argmax_a Q(s,a) & \text{otherwise } \end{cases}</math>
 
 
Accounting for the agent's increasing competence over time can be done by using a [[Boltzmann distribution|Boltzmann Distribution]]<nowiki/>learning policy. This works by reducing the amount of exploration over the duration of the training period.
 
 
<math>P(a|s) = \frac{e^{(Q(s,a)/T)}}{\Sigma_be^{(Q(s,a)/T)}}</math>
 
<math>T_{new}=E^{(-dj)}T_{max}+1</math>
 
 
Another learning method is [[Simulated annealing|Simulated Annealing.]] In this method, like previously, the agent decides to explore a unknown action or choses an action with the greatest q value based on the equation below:
 
 
<math>a = \begin{cases} rand(a_n) & \xi\leq e^{Q(s,rand(a_n)-argmax_aQ(s,a))/T} \\ argmax_a Q(s,a) & \text{otherwise } \end{cases}</math>
 
In this method as the value T decreases, the more likely the agent is to pursue known beneficial outcomes.<ref name=":0" />
 
==== '''Frequency of rewardsRewards''' ====
In training reinforcement learning algorithms, agents are rewarded based on their behavior. Variation in the frequency and what occasions that the agent is awarded at can have a large impact on the speed and quality of the outcome of training.
 
When the goal is too difficult for the learning algorithm to complete, <s>they</s> may never reach the goal and will never be rewarded. Additionally, if a reward is received at the end of a task, the algorithm has no way to differentiate between good and bad behavior during the task.<ref>{{Cite journal|last1=Parisi|first1=Simone|last2=Tateo|first2=Davide|last3=Hensel|first3=Maximilian|last4=D'Eramo|first4=Carlo|last5=Peters|first5=Jan|last6=Pajarinen|first6=Joni|date=2019-12-31|title=Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning|journal=Algorithms |volume=15 |issue=3 |page=81 |doi=10.3390/a15030081 |doi-access=free |arxiv=2001.00119 }}</ref> For example, if an algorithm is attempting to learn how to play pong and they make many correct moves but they ultimately loose the point and they are rewarded negatively, there is no way to determine what movements of the paddle were good and what moves were not good due to the reward being too sparse. (<nowiki>https://arxiv.org/abs/2001.00119</nowiki> )
 
==== '''[[Bias–variance tradeoff|Bias–Variance Tradeoff]]''' ====
When training a machine learning model, there is a tradeoff between how well the model fits training data and how well it generalizes to fit the actual data of a problem. This is known as the bias-variance tradeoff. Bias refers to how simple the model is. A high amount of bias will result in a poor fit to most data because it is not able to reflect the complexity of the data. Variance however is how accurately the model fits the training data. A high amount of variance will lead to an overfitting model which will then not be able to be generalized to more data because it will be too specific to the training set of data. This problem means it is important to reduce the bias and variability to find a model that represents the data as simple as possible to be able to generally the data past the training data, but without lacking the complexity of the data.<ref>{{Cite web|last=Singh|first=Seema|date=2018-10-09|title=Understanding the Bias-Variance Tradeoff|url=https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229|access-date=2020-11-16|website=Medium|language=en}}</ref>
 
=== Optimizations ===
 
==== '''Reward Shaping''' ====
Reward shaping is the process of giving an agent intermediate rewards while it learns that are customized to fit the task it is attempting to complete. For example, if an agent is attempting to learn the game [[Atari Breakout]] , they may get a positive reward every time they successfully hit the ball and break a brick instead of successfully completing a level. This will reduce the time it takes an agent to learn a task because it will have to do less guessing. However, using this method reduces the <s>generalizability</s>ability ofto thegeneralize this algorithm to other applications because the <s>reward triggers</s>rewards would need to be tweaked for each individual circumstance, making it not an optimal solution. (<nowikiref>{{Citation|last=Wiewiora|first=Eric|title=Reward Shaping|date=2010|url=https://arxivdoi.org/abs10.1007/1903978-0-387-30164-8_731|encyclopedia=Encyclopedia of Machine Learning|pages=863–865|editor-last=Sammut|editor-first=Claude|place=Boston, MA|publisher=Springer US|language=en|doi=10.1007/978-0-387-30164-8_731|isbn=978-0-387-30164-8|access-date=2020-11-16|editor2-last=Webb|editor2-first=Geoffrey I.02020}}</nowikiref>)
 
==== '''Curiosity Driven Exploration''' ====
The idea behind curiosity driven exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify[ing] the loss function (or even the network architecture) by adding terms to incentivize exploration"<ref>{{Cite book|last1=Reizinger|first1=Patrik|last2=Szemenyei|first2=Márton|date=2019-10-23|title=ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (https://arxivICASSP)|chapter=Attention-Based Curiosity-Driven Exploration in Deep Reinforcement Learning |pages=3542–3546 |doi=10.org/abs1109/ICASSP40776.2020.9054546 |arxiv=1910.10840) |isbn=978-1-5090-6631-5 }}</ref>. The result of this is models that have a smaller chance of getting stuck in a local maximum of achievement.
 
==== '''Hindsight Experience Replay''' ====
Hindsight experience replay is the method of training that involves storing and learning from previous failed attempts to complete a task beyond just a negative reward. While a failed attempt may not have reached the intended goal, it can serve as a lesson for how achieve the unintended result.<ref>{{Cite (https://arxivarXiv|last1=Andrychowicz|first1=Marcin|last2=Wolski|first2=Filip|last3=Ray|first3=Alex|last4=Schneider|first4=Jonas|last5=Fong|first5=Rachel|last6=Welinder|first6=Peter|last7=McGrew|first7=Bob|last8=Tobin|first8=Josh|last9=Abbeel|first9=Pieter|last10=Zaremba|first10=Wojciech|date=2018-02-23|title=Hindsight Experience Replay|class=cs.org/abs/LG |eprint=1707.01495) }}</ref>
 
== Generalization ==
One thing deepDeep reinforcement learning excels at is generalization, or the ability to use one machine learning model for multiple tasks.
 
When using reinforcement learning, the model must be aware of its environment which is usually provided manually but when this is combined with deep learning, which is very good at dictating features from raw data (e.g. pixels or raw image files) the algorithm gets the benefits of reinforcement learning without being told what it's environment looks like. With this layer of abstraction, deep reinforcement learning algorithms can become generalized and the same model can be used for different tasks. Automatic feature extraction can provide much better accuracy than if a human to do this job<ref>{{Cite web|title=https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_proquest2074058918&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US&context=PC|url=https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_proquest2074058918&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US&context=PC|access-date=2020-10-22|website=ucsb-primo.hosted.exlibrisgroup.com|language=en}}</ref>
 
Reinforcement learning models require an indication state in order to function. When this state is provided by a artificial neural network, which are good at dictating features from raw data (e.g. pixels or raw image files), there is a reduced need to predefine the environment, allowing the model to be generalized to multiple applications. With this layer of abstraction, deep reinforcement learning algorithms can be designed in a way that allows them to become generalized and the same model can be used for different tasks.<ref>{{Cite arXiv|last1=Packer|first1=Charles|last2=Gao|first2=Katelyn|last3=Kos|first3=Jernej|last4=Krähenbühl|first4=Philipp|last5=Koltun|first5=Vladlen|last6=Song|first6=Dawn|date=2019-03-15|title=Assessing Generalization in Deep Reinforcement Learning|class=cs.LG |eprint=1810.12282 }}</ref>
https://ucsb-primo.hosted.exlibrisgroup.com/permalink/f/12e9sm9/TN_arxiv1810.12282
 
== References ==<!--- See http://en.wikipedia.org/wiki/Wikipedia:Footnotes on how to create references using <ref></ref> tags, these references will then appear here automatically -->