Markov decision process: Difference between revisions

Content deleted Content added
Fixed typo "agle" to "angle".
Citation bot (talk | contribs)
Alter: title, template type. Add: isbn, volume, date, series, issue, pages, chapter-url, chapter, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 349/591
Line 2:
'''Markov decision process''' ('''MDP'''), also called a [[Stochastic dynamic programming|stochastic dynamic program]] or stochastic control problem, is a model for [[sequential decision making]] when [[Outcome (probability)|outcomes]] are uncertain.<ref>{{Cite book |last=Puterman |first=Martin L. |title=Markov decision processes: discrete stochastic dynamic programming |date=1994 |publisher=Wiley |isbn=978-0-471-61977-2 |series=Wiley series in probability and mathematical statistics. Applied probability and statistics section |___location=New York}}</ref>
 
Originating from [[operations research]] in the 1950s,<ref>{{Cite journalbook |lastlast1=Schneider |firstfirst1=S. |last2=Wagner |first2=D. H. |date=1957-02-26 |titlechapter=Error detection in redundant systems |urldate=https://dl.acm.org/doi/10.1145/1455567.14555871957-02-26 |journaltitle=Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability |series=on - IRE-AIEE-ACM '57 (Western) |chapter-url=https://dl.acm.org/doi/10.1145/1455567.1455587 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=115–121 |doi=10.1145/1455567.1455587 |isbn=978-1-4503-7861-1}}</ref><ref>{{Cite journal |last=Bellman |first=Richard |date=1958-09-01 |title=Dynamic programming and stochastic control processes |url=https://linkinghub.elsevier.com/retrieve/pii/S0019995858800030 |journal=Information and Control |volume=1 |issue=3 |pages=228–239 |doi=10.1016/S0019-9958(58)80003-0 |issn=0019-9958}}</ref> MDPs have since gained recognition in a variety of fields, including [[ecology]], [[economics]], [[Health care|healthcare]], [[telecommunications]] and [[reinforcement learning]].<ref name=":0">{{Cite book |lastlast1=Sutton |firstfirst1=Richard S. |title=Reinforcement learning: an introduction |last2=Barto |first2=Andrew G. |date=2018 |publisher=The MIT Press |isbn=978-0-262-03924-6 |edition=2nd |series=Adaptive computation and machine learning series |___location=Cambridge, Massachusetts}}</ref> Reinforcement learning utilizes the MDP framework to model the interaction between a learning agent and its environment. In this framework, the interaction is characterized by states, actions, and rewards. The MDP framework is designed to provide a simplified representation of key elements of [[artificial intelligence]] challenges. These elements encompass the understanding of [[Causality|cause and effect]], the management of uncertainty and nondeterminism, and the pursuit of explicit goals.<ref name=":0" />
 
The name comes from its connection to [[Markov chain|Markov chains]], a concept developed by the Russian mathematician [[Andrey Markov]]. The "Markov" in "Markov decision process" refers to the underlying structure of [[Transition system|state transitions]] that still follow the [[Markov property]]. The process is called a "decision process" because it involves making decisions that influence these state transitions, extending the concept of a Markov chain into the realm of decision-making under uncertainty.
Line 105:
| volume=12
| issue=3
| pages=441–450 | doi=10.1287/moor.12.3.441
| access-date=November 2, 2023| hdl=1721.1/2893
| hdl-access=free
}}</ref> However, due to the [[curse of dimensionality]], the size of the problem representation is often exponential in the number of state and action variables, limiting exact solution techniques to problems that have a compact representation. In practice, online planning techniques such as [[Monte Carlo tree search]] can find useful solutions in larger problems, and, in theory, it is possible to construct online planning algorithms that can find an arbitrarily near-optimal policy with no computational complexity dependence on the size of the state space.<ref>{{cite journal|last1=Kearns|first1=Michael|last2=Mansour|first2=Yishay|last3=Ng|first3=Andrew|date=November 2002|title=A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes|url=https://link.springer.com/article/10.1023/A:1017932429737|journal=Machine Learning|volume=49|issue=2/3 |pages=193–208 |doi=10.1023/A:1017932429737|access-date=November 2, 2023|doi-access=free}}</ref>
 
==Extensions and generalizations==
Line 148:
 
====Discrete space: Linear programming formulation====
If the state space and action space are finite, we could use linear programming to find the optimal policy, which was one of the earliest approaches applied. Here we only consider the ergodic model, which means our continuous-time MDP becomes an [[Ergodicity|ergodic]] continuous-time Markov chain under a stationary [[policy]]. Under this assumption, although the decision maker can make a decision at any time in the current state, there is no benefit in taking multiple actions. It is better to take an action only at the time when system is transitioning from the current state to another state. Under some conditions,<ref>{{Cite book |url=https://link.springer.com/book/10.1007/978-3-642-02547-1 |title=Continuous-Time Markov Decision Processes |series=Stochastic Modelling and Applied Probability |date=2009 |volume=62 |language=en |doi=10.1007/978-3-642-02547-1|isbn=978-3-642-02546-4 }}</ref> if our optimal value function <math>V^*</math> is independent of state <math>i</math>, we will have the following inequality:
:<math>g\geq R(i,a)+\sum_{j\in S}q(j\mid i,a)h(j) \quad \forall i \in S \text{ and } a \in A(i)</math>
If there exists a function <math>h</math>, then <math>\bar V^*</math> will be the smallest <math>g</math> satisfying the above equation. In order to find <math>\bar V^*</math>, we could use the following linear programming model: