Revision as of 23:22, 15 November 2020 edit ZachsGenericUsername (talk \| contribs) 23 edits math Tag: Visual edit: Switched ← Previous edit		Revision as of 00:26, 16 November 2020 edit undo ZachsGenericUsername (talk \| contribs) 23 edits edited headers Tag: Visual edit Next edit →
Line 8: === Reinforcement Learning === [[File:Markov diagram v2.svg\|alt=Diagram explaining the loop recurring in reinforcement learning algorithms\|thumb\|Diagram of the loop recurring in reinforcement learning algorithms]][[Reinforcement learning]] is a process in which an agent learns to preform an action through trial and error.~~<nowiki>[https://arxiv.org/abs/2001.00119]</nowiki>~~ In this process, the agent receives a reward indicating whether their previous action was good or bad and aims to optimize their behavior based on this reward. <nowiki>[https://arxiv.org/abs/2001.00119]</nowiki> === Deep Learning === Line 27: == Training == In order to have a functional agent, the algorithm must be trained with a certain goal. There are different techniques used to train agents, each having their own benefits. ~~=== Basics of training using DRL ===~~ ==== Q-Learning ==== Line 35 ⟶ 33: [[Q-learning]] attempts to determine the optimal action given a specific state. The way this method determines the Q value, or quality of an action, can be loosely defined by a function taking in a state "s" and and action "a" and outputting the perceived quality of that action: {{<math~~\|''~~>Q''(''s'',''a'') }}</math> The training process of Q-learning involves exploring different actions and recording the table of q values that correspond state and actions. Once this agent is sufficiently trained, the table should provide a good representation of the quality of actions given their state. Line 49 ⟶ 47: ==== '''Exploration exploitation dilemma''' ==== The exploration exploitation dilemma is the problem of deciding whether to pursue actions that are already known to yield success or explore other ~~pathways~~actions in order to discover greater success. There are two main approaches to learning policies to solve this problem, greedy, and epsilon-greedy. In the greedy learning policy the agent chooses actions that ~~maximize~~have the greatest the q value. for the given state:▼ <math>a=argmax_n Q(s,a)</math> ▲In the greedy learning policy the agent chooses actions that maximize the q value. ~~{{Math\|''a = argmaxa_a Q(s,a)''}}~~ With this solution, the agent may get stuck in a local maximum of success and not discover possible greater success because it only focuses maximizing the q value given its current knowledge of the quality of actions. ~~{{Math\|''a''=''argmaxa_a''''Q''(''s'',''a'')}}~~ In the epsilon-greedy method of training, before determining each action the agent decides whether to prioritize exploration and take an action with an uncertain outcome, or exploitation and pick an action that maximize the q value. Every interval an action is chosen, a random number between zero and one is selected. If this value is above the specified value of epsilon, the agent will choose a value that prioritizes exploration, otherwise the agent will select an action attempting to maximize the q value. ~~{{math\|''Q''(''s'',''a'') }}~~ <math>a = \begin{cases} rand(a_n) & \text{rand(0,1)}\leq\xi \\ argmax_a Q(s,a) & \text{otherwise } \end{cases}</math> <nowiki>https://search-proquest-com.proxy.library.ucsb.edu:9443/docview/1136383063?accountid=14522</nowiki>

User:ZachsGenericUsername/sandbox/Deep reinforcement learning: Difference between revisions