Revision as of 02:40, 27 June 2025 edit Maxeto0910 (talk \| contribs) Extended confirmed users 117,179 edits →Value iteration Tag: Visual edit ← Previous edit		Revision as of 20:19, 22 July 2025 edit undo 91.40.191.31 (talk) →top: better describe the effect of the discount factor, briefly Tags: Mobile edit Mobile app edit Android app edit App full source Next edit →
Line 26: :<math>E\left[\sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math> where <math>\ \gamma \ </math> is the discount factor satisfying <math>0 \le\ \gamma\ \le\ 1</math>, which is usually close to <math>1</math> (for example, <math> \gamma = 1/(1+r) </math> for some discount rate <math>r</math>). A lower discount factor ~~motivates~~makes the decision maker tomore ~~favor~~short-sighted, ~~taking~~in ~~actions~~that ~~early,~~it ~~rather~~comparatively ~~than~~disregards ~~postpone~~the ~~them~~effect that following its current policy has at times lying further in the ~~indefinitely~~future. Another possible, but strictly related, objective that is commonly used is the <math>H-</math>step return. This time, instead of using a discount factor <math>\ \gamma \ </math>, the agent is interested only in the first <math>H</math> steps of the process, with each reward having the same weight.

Markov decision process: Difference between revisions