Content deleted Content added
→See also: already linked in the article |
some copy-editing |
||
Line 2:
'''Stochastic approximation''' methods are a family of [[Iterative methods|iterative methods]] typically used for [[Root-finding|root-finding]] problems or for [[optimization problem|optimization]] problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating [[Extremum|extreme values]] of functions which cannot be computed directly, but only estimated via noisy observations.
For instance in engineering many optimization problems are often of this type when you do not have a mathematical model of the system (which can be too complex) but still would like to optimize its behavior by adjusting certain parameters. For this purpose, you can do experiments or run simulations to evaluate the performance of the system at given values of the parameters. Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.<ref>{{cite web|last1=Le Ny|first1=Jerome|title=Introduction to Stochastic Approximation Algorithms|url=http://www.professeurs.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf|website=Polytechnique Montreal|publisher=Teaching Notes|accessdate=16 November 2016}}</ref>
In a nutshell, stochastic approximation algorithms deal with a function of the form <math display="inline"> f(\theta) = \operatorname E_{\xi} [F(\theta,\xi)] </math>
which is the expected value of a function depending on a [[random variable]] <math display="inline">\xi </math>. The goal is to recover properties of such a function <math display="inline">f</math> without evaluating it directly. Instead, stochastic approximation algorithms use random samples of <math display="inline">F(\theta,\xi)</math> to efficiently approximate properties of <math display="inline">f</math> such as zeros or extrema.
The earliest, and prototypical, algorithms of this kind are the '''
==Robbins–Monro algorithm==
Line 24:
===Complexity results===
#If <math display="inline">f(\theta)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math display="inline">f(\theta)</math> belongs to the interior of <math display="inline">\Theta</math>, then the
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin <ref name="NYcomp">Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' '''15''' ''John Wiley'' ''New York'' (1983) .</ref> have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is <math display="inline">O(1/\sqrt{n})</math>. They have also proven that this rate cannot be improved.
===Subsequent developments and
While the
Chung<ref>{{Cite journal|last=Chung|first=K. L.|date=1954-09-01|title=On a Stochastic Approximation Method|journal=The Annals of Mathematical Statistics|language=EN|volume=25|issue=3|pages=463–483|doi=10.1214/aoms/1177728716|issn=0003-4851}}</ref>(1954) and Fabian<ref>{{Cite journal|last=Fabian|first=Vaclav|date=1968-08-01|title=On Asymptotic Normality in Stochastic Approximation|journal=The Annals of Mathematical Statistics|language=EN|volume=39|issue=4|pages=1327–1332|doi=10.1214/aoms/1177698258|issn=0003-4851}}</ref>(1968) showed that we would achieve optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> with <math display="inline">a_n=\bigtriangledown^2f(\theta^*)^{-1}/n</math> (or <math display="inline">a_n=\frac{1}{(nM'(\theta^*))}</math>). Lai and Robbins<ref>{{Cite journal|last=Lai|first=T. L.|last2=Robbins|first2=Herbert|date=1979-11-01|title=Adaptive Design and Stochastic Approximation|journal=The Annals of Statistics|language=EN|volume=7|issue=6|pages=1196–1221|doi=10.1214/aos/1176344840|issn=0090-5364}}</ref><ref>{{Cite journal|last=Lai|first=Tze Leung|last2=Robbins|first2=Herbert|date=1981-09-01|title=Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes|journal=Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete|language=en|volume=56|issue=3|pages=329–360|doi=10.1007/BF00536178|issn=0044-3719}}</ref> designed adaptive procedures to estimate <math display="inline">M'(\theta^*)</math> such that <math display="inline">\theta_n</math> has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak<ref>{{Cite journal|last=Polyak|first=B T|date=1990-01-01|title=New stochastic approximation type procedures. (In Russian.)|url=https://www.researchgate.net/publication/236736759|volume=7|issue=7}}</ref>(1991) and Ruppert<ref>{{Cite journal|last=Ruppert|first=D.|title=Efficient estimators from a slowly converging robbins-monro process|url=https://www.researchgate.net/publication/242608650}}</ref>(1988) independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky<ref name="pj">{{Cite journal | last1 = Polyak | first1 = B. T. | last2 = Juditsky | first2 = A. B. | doi = 10.1137/0330046 | title = Acceleration of Stochastic Approximation by Averaging | journal = SIAM Journal on Control and Optimization | volume = 30 | issue = 4 | pages = 838 | year = 1992 | pmid = | pmc = }}</ref> also presented a method of accelerating
'''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math>
Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the
A more general result is given in Chapter 11 of Kushner and Yin<ref>{{Cite book|url=https://www.springer.com/us/book/9780387008943|title=Stochastic Approximation and Recursive Algorithms and {{!}} Harold Kushner {{!}} Springer|website=www.springer.com|access-date=2016-05-16|isbn=9780387008943|last1=Kushner|first1=Harold|last2=George Yin|first2=G.|date=2003-07-17}}</ref> by defining interpolated time <math display="inline">t_n=\sum_{i=0}^{n-1}a_i</math>, interpolated process <math display="inline">\theta^n(\cdot)</math> and interpolated normalized process <math display="inline">U^n(\cdot)</math> as
Line 53:
Suppose we want to solve the following stochastic optimization problem
<math display="block">g(\theta^*) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^*</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>,
</math> , in which the conditional expectation of <math>X_n
Line 118:
Suppose <math>Q(\theta, X) = f(\theta) + \theta^T X</math>, where <math>f</math> is differentiable and <math>X\in \mathbb{R}^p</math> is a random variable independent of <math>\theta</math>. Then <math>g(\theta)=\operatorname{E}[Q(\theta,X)] = f(\theta)+\theta^T\operatorname{E}X</math> depends on the mean of <math>X</math>, and the stochastic gradient method would be appropriate in this problem. We can choose <math>H(\theta, X) = \frac{\partial}{\partial\theta}Q(\theta,X) = \frac{\partial}{\partial\theta}f(\theta) + X.</math>
==
The
::<math> x_{n+1} = x_n + a_n \bigg(\frac{N(x_n + c_n) - N(x_n -c_n)}{2 c_n} \bigg) </math>
where <math>N(x_n+c_n)</math> and <math>N(x_n-c_n)</math> are independent, and the gradient of <math>M(x)</math> is approximated using finite differences. The sequence <math>\{c_n\}</math> specifies the sequence of finite difference widths used for the gradient approximation, while the sequence <math>\{a_n\}</math> specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if <math>M(x)</math> satisfied certain regularity conditions, then <math>x_n</math> will converge to <math>\theta</math> in probability as <math>n\to\infty
Line 139:
===Subsequent developments and important issues===
#The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least <math>d+1</math> different parameter values must be simulated for every iteration of the algorithm, where <math>d </math> is the dimension of the search space. This means that when <math>d</math> is large, the
## To address this problem, Spall proposed the use of [[Simultaneous perturbation stochastic approximation|simultaneous perturbations]] to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension <math>d</math>.<ref name = "Jsp">{{Cite journal | last1 = Spall | first1 = J. C. | title = Adaptive stochastic approximation by the simultaneous perturbation method | doi = 10.1109/TAC.2000.880982 | journal = IEEE Transactions on Automatic Control | volume = 45 | issue = 10 | pages = 1839–1853 | year = 2000 | pmid = | pmc = }}</ref>
#In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the ___domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.
Line 149:
stochastic approximation to [[Robust statistics|robust]] [[estimation]].<ref>{{Cite journal | last1 = Martin | first1 = R. | last2 = Masreliez | first2 = C. | doi = 10.1109/TIT.1975.1055386 | title = Robust estimation via stochastic approximation | journal = IEEE Transactions on Information Theory | volume = 21 | issue = 3 | pages = 263 | year = 1975 | pmid = | pmc = }}</ref>
The main tool for analyzing stochastic approximations algorithms (including the
==See also==
|