Revision as of 05:43, 24 August 2024 edit Condordellanebbia (talk \| contribs) 12 edits m →Optimization objective ← Previous edit		Revision as of 05:52, 24 August 2024 edit undo Condordellanebbia (talk \| contribs) 12 edits Added the concept of <math>H-</math>step return, that is common in learning theory Next edit →
Line 32: where <math>\ \gamma \ </math> is the discount factor satisfying <math>0 \le\ \gamma\ \le\ 1</math>, which is usually close to <math>1</math> (for example, <math> \gamma = 1/(1+r) </math> for some discount rate <math>r</math>). A lower discount factor motivates the decision maker to favor taking actions early, rather than postpone them indefinitely. Another possible, but strictly related, objective that is commonly used is the <math>H-</math>step return. This time, instead of using a discount factor <math>\ \gamma \ </math>, the agent is interested only in the first <math>H</math> steps of the process, with each reward having the same weight. :<math>E\left[\sum^{H-1}_{t=0} {R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math> where <math>\ H \ </math> is the time horizon. Compared to the previous objective, the latter one is more used in [[Learning Theory]]. A policy that maximizes the function above is called an ''<dfn>optimal policy</dfn>'' and is usually denoted <math>\pi^*</math>. A particular MDP may have multiple distinct optimal policies. Because of the Markov property, it can be shown that the optimal policy is a function of the current state, as assumed above.

Markov decision process: Difference between revisions