Content deleted Content added
m v2.05 - Fix errors for CW project (Link equal to linktext) |
m Open access bot: url-access=subscription updated in citation with #oabot. |
||
(20 intermediate revisions by 13 users not shown) | |||
Line 45:
=== Exploration ===
An RL agent must balance the exploration/exploitation tradeoff: the problem of deciding whether to pursue actions that are already known to yield high rewards or explore other actions in order to discover higher rewards. RL agents usually collect data with some type of stochastic policy, such as a [[Boltzmann distribution]] in discrete action spaces or a [[Normal distribution|Gaussian distribution]] in continuous action spaces, inducing basic exploration behavior. The idea behind novelty-based, or curiosity-driven, exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify[ing] the loss function (or even the network architecture) by adding terms to incentivize exploration".<ref>{{cite book|last1=Reizinger|first1=Patrik|last2=Szemenyei|first2=Márton|date=2019-10-23|title=ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)|chapter=Attention-Based Curiosity-Driven Exploration in Deep Reinforcement Learning |pages=3542–3546 |doi=10.1109/ICASSP40776.2020.9054546 |arxiv=1910.10840|isbn=978-1-5090-6631-5 |s2cid=204852215 }}</ref> An agent may also be aided in exploration by utilizing demonstrations of successful trajectories, or reward-shaping, giving an agent intermediate rewards that are customized to fit the task it is attempting to complete.<ref>{{Citation|last=Wiewiora|first=Eric|title=Reward Shaping|date=2010|url=https://doi.org/10.1007/978-0-387-30164-8_731|encyclopedia=Encyclopedia of Machine Learning|pages=863–865|editor-last=Sammut|editor-first=Claude|place=Boston, MA|publisher=Springer US|language=en|doi=10.1007/978-0-387-30164-8_731|isbn=978-0-387-30164-8|access-date=2020-11-16|editor2-last=Webb|editor2-first=Geoffrey I.|url-access=subscription}}</ref>
=== Off-policy reinforcement learning ===
Line 86:
<ref name="AlphaGo">{{Cite journal|title = Mastering the game of Go with deep neural networks and tree search|journal = [[Nature (journal)|Nature]]| issn= 0028-0836|pages = 484–489|volume = 529|issue = 7587|doi = 10.1038/nature16961|pmid = 26819042|first1 = David|last1 = Silver|author-link1=David Silver (programmer)|first2 = Aja|last2 = Huang|author-link2=Aja Huang|first3 = Chris J.|last3 = Maddison|first4 = Arthur|last4 = Guez|first5 = Laurent|last5 = Sifre|first6 = George van den|last6 = Driessche|first7 = Julian|last7 = Schrittwieser|first8 = Ioannis|last8 = Antonoglou|first9 = Veda|last9 = Panneershelvam|first10= Marc|last10= Lanctot|first11= Sander|last11= Dieleman|first12=Dominik|last12= Grewe|first13= John|last13= Nham|first14= Nal|last14= Kalchbrenner|first15= Ilya|last15= Sutskever|author-link15=Ilya Sutskever|first16= Timothy|last16= Lillicrap|first17= Madeleine|last17= Leach|first18= Koray|last18= Kavukcuoglu|first19= Thore|last19= Graepel|first20= Demis |last20=Hassabis|author-link20=Demis Hassabis|date= 28 January 2016|bibcode = 2016Natur.529..484S|s2cid = 515925}}{{closed access}}</ref>
<ref name="levine2016">{{Cite journal |last1=Levine |first1=Sergey |last2=Finn |first2=Chelsea |author-link2=Chelsea Finn |last3=Darrell |first3=Trevor |last4=Abbeel |first4=Pieter |date=January 2016 |title=End-to-end training of deep visuomotor policies |url=https://www.jmlr.org/papers/volume17/15-389/15-389.pdf |journal=JMLR |volume=17 |arxiv=1504.00702}}</ref>
<ref name="openaihand">{{Cite web|title=OpenAI - Solving Rubik's Cube With A Robot Hand|url=https://openai.com/blog/solving-rubiks-cube/|website=OpenAI|date=5 January 2021 }}</ref>
<ref name="openaihandarxiv">{{Cite conference|title= Solving Rubik's Cube with a Robot Hand |last1=OpenAI |display-authors=etal|date=2019|arxiv=1910.07113 }}</ref>
<ref name="deepmindcooling">{{Cite web|title=DeepMind AI Reduces Google Data Centre Cooling Bill by 40% |url=https://deepmind.com/blog/article/deepmind-ai-reduces-google-data-centre-cooling-bill-40|website=DeepMind|date=14 May 2024 }}</ref>
Line 99:
<ref name="schaul2015uva">{{Cite conference| title=Universal Value Function Approximators|last1=Schaul|first1=Tom |last2=Horgan|first2=Daniel |last3=Gregor|first3=Karol |last4=Silver|first4=David |conference=International Conference on Machine Learning (ICML) |date=2015| url=http://proceedings.mlr.press/v37/schaul15.html}}</ref>
<ref name="muzero">{{cite journal |last1=Schrittwieser |first1=Julian |last2=Antonoglou |first2=Ioannis |last3=Hubert |first3=Thomas |last4=Simonyan |first4=Karen |last5=Sifre |first5=Laurent |last6=Schmitt |first6=Simon |last7=Guez |first7=Arthur |last8=Lockhart |first8=Edward |last9=Hassabis |first9=Demis |last10=Graepel |first10=Thore |last11=Lillicrap |first11=Timothy |last12=Silver |first12=David |title=Mastering Atari, Go, chess and shogi by planning with a learned model |journal=Nature |date=23 December 2020 |volume=588 |issue=7839 |pages=604–609 |doi=10.1038/s41586-020-03051-4 |pmid=33361790 |url=https://www.nature.com/articles/s41586-020-03051-4|arxiv=1911.08265 |bibcode=2020Natur.588..604S |s2cid=208158225 }}</ref>
<ref name="loonrl">{{cite journal |last1=Bellemare |first1=Marc |last2=Candido |first2=Salvatore |last3=Castro |first3=Pablo |last4=Gong |first4=Jun |last5=Machado |first5=Marlos |last6=Moitra |first6=Subhodeep |last7=Ponda |first7=Sameera |last8=Wang |first8=Ziyu |title=Autonomous navigation of stratospheric balloons using reinforcement learning |journal=Nature |date=2 December 2020 |volume=588 |issue=7836 |pages=77–82 |doi=10.1038/s41586-020-2939-8 |pmid=33268863 |bibcode=2020Natur.588...77B |s2cid=227260253 |url=https://www.nature.com/articles/s41586-020-2939-8|url-access=subscription }}</ref>
<ref name="deepirl">{{cite arXiv| last1=Wulfmeier|first1=Markus|last2=Ondruska|first2=Peter|last3=Posner|first3=Ingmar|date=2015|title= Maximum Entropy Deep Inverse Reinforcement Learning |class=cs.LG|eprint=1507.04888}}</ref>
</references>
|