Content deleted Content added
Fixed typo "agle" to "angle". |
Citation bot (talk | contribs) Alter: title, template type. Add: isbn, volume, date, series, issue, pages, chapter-url, chapter, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 349/591 |
||
Line 2:
'''Markov decision process''' ('''MDP'''), also called a [[Stochastic dynamic programming|stochastic dynamic program]] or stochastic control problem, is a model for [[sequential decision making]] when [[Outcome (probability)|outcomes]] are uncertain.<ref>{{Cite book |last=Puterman |first=Martin L. |title=Markov decision processes: discrete stochastic dynamic programming |date=1994 |publisher=Wiley |isbn=978-0-471-61977-2 |series=Wiley series in probability and mathematical statistics. Applied probability and statistics section |___location=New York}}</ref>
Originating from [[operations research]] in the 1950s,<ref>{{Cite
The name comes from its connection to [[Markov chain|Markov chains]], a concept developed by the Russian mathematician [[Andrey Markov]]. The "Markov" in "Markov decision process" refers to the underlying structure of [[Transition system|state transitions]] that still follow the [[Markov property]]. The process is called a "decision process" because it involves making decisions that influence these state transitions, extending the concept of a Markov chain into the realm of decision-making under uncertainty.
Line 105:
| volume=12
| issue=3
| pages=441–450 | doi=10.1287/moor.12.3.441
| access-date=November 2, 2023| hdl=1721.1/2893
| hdl-access=free
}}</ref> However, due to the [[curse of dimensionality]], the size of the problem representation is often exponential in the number of state and action variables, limiting exact solution techniques to problems that have a compact representation. In practice, online planning techniques such as [[Monte Carlo tree search]] can find useful solutions in larger problems, and, in theory, it is possible to construct online planning algorithms that can find an arbitrarily near-optimal policy with no computational complexity dependence on the size of the state space.<ref>{{cite journal|last1=Kearns|first1=Michael|last2=Mansour|first2=Yishay|last3=Ng|first3=Andrew|date=November 2002|title=A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes
==Extensions and generalizations==
Line 148:
====Discrete space: Linear programming formulation====
If the state space and action space are finite, we could use linear programming to find the optimal policy, which was one of the earliest approaches applied. Here we only consider the ergodic model, which means our continuous-time MDP becomes an [[Ergodicity|ergodic]] continuous-time Markov chain under a stationary [[policy]]. Under this assumption, although the decision maker can make a decision at any time in the current state, there is no benefit in taking multiple actions. It is better to take an action only at the time when system is transitioning from the current state to another state. Under some conditions,<ref>{{Cite book |url=https://link.springer.com/book/10.1007/978-3-642-02547-1 |title=Continuous-Time Markov Decision Processes |series=Stochastic Modelling and Applied Probability |date=2009 |volume=62 |language=en |doi=10.1007/978-3-642-02547-1|isbn=978-3-642-02546-4 }}</ref> if our optimal value function <math>V^*</math> is independent of state <math>i</math>, we will have the following inequality:
:<math>g\geq R(i,a)+\sum_{j\in S}q(j\mid i,a)h(j) \quad \forall i \in S \text{ and } a \in A(i)</math>
If there exists a function <math>h</math>, then <math>\bar V^*</math> will be the smallest <math>g</math> satisfying the above equation. In order to find <math>\bar V^*</math>, we could use the following linear programming model:
|