Content deleted Content added
Maxeto0910 (talk | contribs) |
→top: better describe the effect of the discount factor, briefly Tags: Mobile edit Mobile app edit Android app edit App full source |
||
Line 26:
:<math>E\left[\sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math>
where <math>\ \gamma \ </math> is the discount factor satisfying <math>0 \le\ \gamma\ \le\ 1</math>, which is usually close to <math>1</math> (for example, <math> \gamma = 1/(1+r) </math> for some discount rate <math>r</math>). A lower discount factor
Another possible, but strictly related, objective that is commonly used is the <math>H-</math>step return. This time, instead of using a discount factor <math>\ \gamma \ </math>, the agent is interested only in the first <math>H</math> steps of the process, with each reward having the same weight.
|