Content deleted Content added
Added the concept of <math>H-</math>step return, that is common in learning theory |
|||
Line 32:
where <math>\ \gamma \ </math> is the discount factor satisfying <math>0 \le\ \gamma\ \le\ 1</math>, which is usually close to <math>1</math> (for example, <math> \gamma = 1/(1+r) </math> for some discount rate <math>r</math>). A lower discount factor motivates the decision maker to favor taking actions early, rather than postpone them indefinitely.
Another possible, but strictly related, objective that is commonly used is the <math>H-</math>step return. This time, instead of using a discount factor <math>\ \gamma \ </math>, the agent is interested only in the first <math>H</math> steps of the process, with each reward having the same weight.
:<math>E\left[\sum^{H-1}_{t=0} {R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math>
where <math>\ H \ </math> is the time horizon. Compared to the previous objective, the latter one is more used in [[Learning Theory]].
A policy that maximizes the function above is called an ''<dfn>optimal policy</dfn>'' and is usually denoted <math>\pi^*</math>. A particular MDP may have multiple distinct optimal policies. Because of the Markov property, it can be shown that the optimal policy is a function of the current state, as assumed above.
|