User:ZachsGenericUsername/sandbox/Deep reinforcement learning: Difference between revisions
Content deleted Content added
re added overview, challenges and solutions sections |
did some work cleaning things up |
||
Line 1:
{{Userspace draft|source=ArticleWizard|date=October 2020}}
'''Deep reinforcement learning (DRL)''' is a [[machine learning]] method that takes principles from both [[reinforcement learning]] and [[deep learning]] to obtain benefits from both. Deep reinforcement learning has a
== Overview ==
Line 15:
=== Deep Reinforcement Learning ===
Deep Reinforcement Learning combines both the techniques of giving rewards based on actions from reinforcement learning and the idea of using a neural network to process data from deep learning
==
Deep reinforcement learning has
* [[Deep Q-learning|Deep Q]] networks, or learning algorithms without a specified model that analyze a situation and produce an action the agent should take.▼
* The [[AlphaZero]] algorithm, developed by [[DeepMind]], that has achieved super-human like performance in many games.
* Image enhance models such as GAN and Unet which have attained much higher performance compared to the previous methods such as [[Super-resolution imaging|super-resolution]] and segmentation<ref>{{Cite book|url=https://www.worldcat.org/oclc/1163522253|title=Deep reinforcement learning fundamentals, research and applications|date=2020|publisher=Springer|others=Dong, Hao., Ding, Zihan., Zhang, Shanghang.|isbn=978-981-15-4095-0|___location=Singapore|oclc=1163522253}}</ref>
* Procedural level generation in video games <ref>{{Cite web|last=|first=|date=|title=Fix me :(|url=https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_proquest2073529169&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US&context=PC|url-status=live|archive-url=|archive-date=|access-date=2020-10-29|website=ucsb-primo.hosted.exlibrisgroup.com|language=en}}</ref>
== Challenges ==▼
Some challenges in training algorithms that have to be overcome include:▼
▲
* '''Frequency of rewards:''' When the goal is too difficult for the learning algorithm to complete, they may never reach the goal and will never be rewarded. Additionally, if a reward is received at the end of a task, the algorithm has no way to differentiate between good and bad behavior during the task. For example, if an algorithm is attempting to learn how to play pong and they make many correct moves but they ultimately loose the point and they are rewarded negatively, there is no way to determine what movements of the paddle were good and what moves were not good due to the reward being too sparse. (<nowiki>https://arxiv.org/abs/2001.00119</nowiki>)▼
* '''Reward Shaping''': The process of giving an agent intermediate rewards during the process of learning that are shaped to the task. For example, if an agent is attempting to learn the game [[Atari Breakout]], they may get a positive reward every time they successfully hit the ball and break a brick instead of successfully completing a level. While this will reduce the need time it takes an agent to learn a task due to less randomness involved and more guided actions taking place, it reduces the generalizability of the algorithm because it would need to be tweaked for each individual circumstance making it not an optimal solution. (<nowiki>https://arxiv.org/abs/1903.02020</nowiki>)▼
==
In order to have a functional agent, the algorithm must be trained with a certain goal. The training process of of
Auxiliary reward signals▼
▲=== Challenges ===
Curiosity driven exploration▼
▲* '''Frequency of rewards:''' When the goal is too difficult for the learning algorithm to complete, they may never reach the goal and will never be rewarded. Additionally, if a reward is received at the end of a task, the algorithm has no way to differentiate between good and bad behavior during the task. For example, if an algorithm is attempting to learn how to play pong and they make many correct moves but they ultimately loose the point and they are rewarded negatively, there is no way to determine what movements of the paddle were good and what moves were not good due to the reward being too sparse. (<nowiki>https://arxiv.org/abs/2001.00119</nowiki> )
Hindsight experience replay ▼
=== Optimizations ===
▲* '''Reward Shaping'''
▲* '''Auxiliary reward signals'''
▲* '''Curiosity driven exploration'''
== Generalization ==
|