Markov decision process: Difference between revisions

Content deleted Content added
top: better describe the effect of the discount factor, briefly
Tags: Mobile edit Mobile app edit Android app edit App full source
Line 26:
:<math>E\left[\sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math>
 
where <math>\ \gamma \ </math> is the discount factor satisfying <math>0 \le\ \gamma\ \le\ 1</math>, which is usually close to <math>1</math> (for example, <math> \gamma = 1/(1+r) </math> for some discount rate <math>r</math>). A lower discount factor motivatesmakes the decision maker tomore favorshort-sighted, takingin actionsthat early,it rathercomparatively thandisregards postponethe themeffect that following its current policy has at times lying further in the indefinitelyfuture.
 
Another possible, but strictly related, objective that is commonly used is the <math>H-</math>step return. This time, instead of using a discount factor <math>\ \gamma \ </math>, the agent is interested only in the first <math>H</math> steps of the process, with each reward having the same weight.