Stochastic approximation: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: url. URLs might have been internationalized/anonymized. Add: s2cid, author pars. 1-1. Removed parameters. Some additions/deletions were actually parameter name changes. | You can use this bot yourself. Report bugs here. | Suggested by AManWithNoPlan | All pages linked from cached copy of User:AManWithNoPlan/sandbox2 | via #UCB_webform_linked 237/1890
Monkbot (talk | contribs)
m Task 18 (cosmetic): eval 20 templates: del empty params (17×); hyphenate params (3×);
Line 10:
 
==Robbins–Monro algorithm==
The Robbins–Monro algorithm, introduced in 1951 by [[Herbert Robbins]] and Sutton Monro,<ref name="rm">{{Cite journal | last1 = Robbins | first1 = H. | authorlinkauthor-link = Herbert Robbins| last2 = Monro | first2 = S. | doi = 10.1214/aoms/1177729586 | title = A Stochastic Approximation Method | journal = The Annals of Mathematical Statistics | volume = 22 | issue = 3 | pages = 400 | year = 1951 | pmid = | pmc = | doi-access = free }}</ref> presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function <math display="inline">M(\theta)</math>, and a constant <math display="inline">\alpha</math>, such that the equation <math display="inline">M(\theta) = \alpha</math> has a unique root at <math display="inline">\theta^*</math>. It is assumed that while we cannot directly observe the function <math display="inline">M(\theta)</math>, we can instead obtain measurements of the random variable <math display="inline">N(\theta)</math> where <math display="inline">\operatorname E[N(\theta)] = M(\theta)</math>. The structure of the algorithm is to then generate iterates of the form:
 
::<math>\theta_{n+1}=\theta_n - a_n(N(\theta_n) - \alpha)</math>
Line 24:
 
===Complexity results===
#If <math display="inline">f(\theta)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math display="inline">f(\theta)</math> belongs to the interior of <math display="inline">\Theta</math>, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being <math display="inline">\operatorname E[f(\theta_n) - f^*] = O(1/n)</math>, where <math display="inline">f^*</math> is the minimal value of <math display="inline">f(\theta)</math> over <math display="inline">\theta \in \Theta</math>.<ref name="jsacks">{{Cite journal | last1 = Sacks | first1 = J. | title = Asymptotic Distribution of Stochastic Approximation Procedures | doi = 10.1214/aoms/1177706619 | journal = The Annals of Mathematical Statistics | volume = 29 | issue = 2 | pages = 373–405 | year = 1958 | jstor = 2237335| pmid = | pmc = | doi-access = free }}</ref><ref name="NJLS">{{Cite journal | last1 = Nemirovski | first1 = A. | authorlink1author-link1 = Arkadi Nemirovski| last2 = Juditsky | first2 = A. | last3 = Lan | first3 = G. | last4 = Shapiro | first4 = A. | title = Robust Stochastic Approximation Approach to Stochastic Programming | doi = 10.1137/070704277 | journal = SIAM Journal on Optimization | volume = 19 | issue = 4 | pages = 1574 | year = 2009 | pmid = | pmc = }}</ref>
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin<ref name="NYcomp">Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' '''15''' ''John Wiley'' ''New York'' (1983) .</ref> have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is <math display="inline">O(1/\sqrt{n})</math>. They have also proven that this rate cannot be improved.
 
Line 30:
While the Robbins–Monro algorithm is theoretically able to achieve <math display="inline"> O(1/n)</math> under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.<ref name="NJLS" /><ref name="jcsbook">[https://books.google.com/books?id=f66OIvvkKnAC&printsec=frontcover#v=onepage&q=%22Robbins-Monro%22&f=false Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control], J.C. Spall, ''John Wiley'' ''Hoboken, NJ'', (2003).</ref>
 
Chung<ref>{{Cite journal|last=Chung|first=K. L.|date=1954-09-01|title=On a Stochastic Approximation Method|journal=The Annals of Mathematical Statistics|language=EN|volume=25|issue=3|pages=463–483|doi=10.1214/aoms/1177728716|issn=0003-4851|doi-access=free}}</ref>(1954) and Fabian<ref>{{Cite journal|last=Fabian|first=Vaclav|date=1968-08-01|title=On Asymptotic Normality in Stochastic Approximation|journal=The Annals of Mathematical Statistics|language=EN|volume=39|issue=4|pages=1327–1332|doi=10.1214/aoms/1177698258|issn=0003-4851|doi-access=free}}</ref>(1968) showed that we would achieve optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> with <math display="inline">a_n=\bigtriangledown^2f(\theta^*)^{-1}/n</math> (or <math display="inline">a_n=\frac{1}{(nM'(\theta^*))}</math>). Lai and Robbins<ref>{{Cite journal|last1=Lai|first1=T. L.|last2=Robbins|first2=Herbert|date=1979-11-01|title=Adaptive Design and Stochastic Approximation|journal=The Annals of Statistics|language=EN|volume=7|issue=6|pages=1196–1221|doi=10.1214/aos/1176344840|issn=0090-5364|doi-access=free}}</ref><ref>{{Cite journal|last1=Lai|first1=Tze Leung|last2=Robbins|first2=Herbert|date=1981-09-01|title=Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes|journal=Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete|language=en|volume=56|issue=3|pages=329–360|doi=10.1007/BF00536178|s2cid=122109044|issn=0044-3719}}</ref> designed adaptive procedures to estimate <math display="inline">M'(\theta^*)</math> such that <math display="inline">\theta_n</math> has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak<ref>{{Cite journal|last=Polyak|first=B T|date=1990-01-01|title=New stochastic approximation type procedures. (In Russian.)|url=https://www.researchgate.net/publication/236736759|volume=7|issue=7}}</ref>(1991) and Ruppert<ref>{{Cite journal|last=Ruppert|first=D.|title=Efficient estimators from a slowly converging robbins-monro process|url=https://www.researchgate.net/publication/242608650}}</ref>(1988) independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky<ref name="pj">{{Cite journal | last1 = Polyak | first1 = B. T. | last2 = Juditsky | first2 = A. B. | doi = 10.1137/0330046 | title = Acceleration of Stochastic Approximation by Averaging | journal = SIAM Journal on Control and Optimization | volume = 30 | issue = 4 | pages = 838 | year = 1992 | pmid = | pmc = }}</ref> also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:<math display="block"> \theta_{n+1} - \theta_n = a_n(\alpha - N(\theta_n)), \qquad \bar{\theta}_n = \frac{1}{n} \sum^{n-1}_{i=0} \theta_i </math>The convergence of <math> \bar{\theta}_n </math> to the unique root <math>\theta^*</math> relies on the condition that the step sequence <math>\{a_n\}</math> decreases sufficiently slowly. That is
 
'''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math>
Line 70:
The following result gives sufficient conditions on <math>\theta_n
 
</math> for the algorithm to converge:<ref>{{Cite book|title=Numerical Methods for stochastic Processes|last1=Bouleau|first1=N.|last2=Lepingle|first2=D.|publisher=John Wiley|year=1994|isbn=9780471546412|___location=New York|pages=|url=https://books.google.com/books?id=9MLL2RN40asC}}</ref>
 
C1) <math>\varepsilon_n \geq 0, \forall\; n\geq 0. </math>
Line 98:
 
==Kiefer–Wolfowitz algorithm==
The Kiefer–Wolfowitz algorithm was introduced in 1952 by [[Jacob Wolfowitz]] and [[Jack_Kiefer_(statistician)|Jack Kiefer]],<ref name = "KW">{{Cite journal | last1 = Kiefer | first1 = J. | last2 = Wolfowitz | first2 = J. | doi = 10.1214/aoms/1177729392 | title = Stochastic Estimation of the Maximum of a Regression Function | journal = The Annals of Mathematical Statistics | volume = 23 | issue = 3 | pages = 462 | year = 1952 | pmid = | pmc = | doi-access = free }}</ref> and was motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let <math>M(x) </math> be a function which has a maximum at the point <math>\theta </math>. It is assumed that <math>M(x)</math> is unknown; however, certain observations <math>N(x)</math>, where <math>\operatorname E[N(x)] = M(x)</math>, can be made at any point <math>x</math>. The structure of the algorithm follows a gradient-like method, with the iterates being generated as follows:
::<math> x_{n+1} = x_n + a_n \bigg(\frac{N(x_n + c_n) - N(x_n -c_n)}{2 c_n} \bigg) </math>
where <math>N(x_n+c_n)</math> and <math>N(x_n-c_n)</math> are independent, and the gradient of <math>M(x)</math> is approximated using finite differences. The sequence <math>\{c_n\}</math> specifies the sequence of finite difference widths used for the gradient approximation, while the sequence <math>\{a_n\}</math> specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if <math>M(x)</math> satisfied certain regularity conditions, then <math>x_n</math> will converge to <math>\theta</math> in probability as <math>n\to\infty
Line 119:
===Subsequent developments and important issues===
#The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least <math>d+1</math> different parameter values must be simulated for every iteration of the algorithm, where <math>d </math> is the dimension of the search space. This means that when <math>d</math> is large, the Kiefer–Wolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.
## To address this problem, Spall proposed the use of [[Simultaneous perturbation stochastic approximation|simultaneous perturbations]] to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension <math>d</math>.<ref name = "Jsp">{{Cite journal | last1 = Spall | first1 = J. C. | title = Adaptive stochastic approximation by the simultaneous perturbation method | doi = 10.1109/TAC.2000.880982 | journal = IEEE Transactions on Automatic Control | volume = 45 | issue = 10 | pages = 1839–1853 | year = 2000 | pmid = | pmc = }}</ref>
#In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the ___domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.
 
==Further developments==
An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.<ref name="kushneryin">{{Cite book | last1 = Kushner | first1 = H. J. | authorlinkauthor-link=Harold J. Kushner| last2 = Yin | first2 = G. G. | doi = 10.1007/978-1-4899-2696-8 | title = Stochastic Approximation Algorithms and Applications | year = 1997 | isbn = 978-1-4899-2698-2 | pmid = | pmc = }}</ref><ref>''Stochastic Approximation and Recursive Estimation'', Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. {{isbn|0-8218-1597-0}}.</ref> These methods are also applied in [[control theory]], in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size <math>a_n</math> should not converge to zero but should be chosen so as to track the function.<ref name="kushneryin"/><sup>, 2nd ed., chapter 3</sup>
 
[[C. Johan Masreliez]] and [[R. Douglas Martin]] were the first to apply
stochastic approximation to [[Robust statistics|robust]] [[estimation]].<ref>{{Cite journal | last1 = Martin | first1 = R. | last2 = Masreliez | first2 = C. | doi = 10.1109/TIT.1975.1055386 | title = Robust estimation via stochastic approximation | journal = IEEE Transactions on Information Theory | volume = 21 | issue = 3 | pages = 263 | year = 1975 | pmid = | pmc = }}</ref>
 
The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the Kiefer–Wolfowitz algorithms) is a theorem by [[Aryeh Dvoretzky]] published in the proceedings of the third Berkeley symposium on mathematical statistics and probability, 1956.<ref>{{Cite journal|last=Dvoretzky|first=Aryeh|date=1956-01-01|title=On Stochastic Approximation|url=http://projecteuclid.org/euclid.bsmsp/1200501645|language=EN|publisher=The Regents of the University of California}}</ref>