Content deleted Content added
→See also: already linked in the article |
|||
(48 intermediate revisions by 26 users not shown) | |||
Line 1:
{{Short description|Family of iterative methods}}
'''Stochastic approximation''' methods are a family of [[
For instance in engineering many optimization problems are often of this type when you do not have a mathematical model of the system (which can be too complex) but still would like to optimize its behavior by adjusting certain parameters. For this purpose, you can do experiments or run simulations to evaluate the performance of the system at given values of the parameters. Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.<ref>{{cite web|last1=Le Ny|first1=Jerome|title=Introduction to Stochastic Approximation Algorithms|url=http://www.professeurs.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf|website=Polytechnique Montreal|publisher=Teaching Notes|accessdate=16 November 2016}}</ref>. These methods are used often in statistics and machine learning that typically need to handle noisy measurements of empirical data. They are also related to [[stochastic optimization]] methods and algorithms.▼
In a nutshell, stochastic approximation algorithms deal with a function of the form <math display="inline"> f(\theta) = \operatorname E_{\xi} [F(\theta,\xi)] </math>
which is the [[expected value]] of a function depending on a [[random variable]] <math display="inline">\xi </math>. The goal is to recover properties of such a function <math display="inline">f</math> without evaluating it directly. Instead, stochastic approximation algorithms use random samples of <math display="inline">F(\theta,\xi)</math> to efficiently approximate properties of <math display="inline">f</math> such as zeros or extrema.
Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings with [[big data]]. These applications range from [[stochastic optimization]] methods and algorithms, to online forms of the [[Expectation–maximization algorithm| EM algorithm]], reinforcement learning via [[Temporal difference learning|temporal differences]], and [[deep learning]], and others.<ref name=":1">{{cite journal |last1=Toulis |first1=Panos |first2=Edoardo |last2=Airoldi|title=Scalable estimation strategies based on stochastic approximations: classical results and new insights |journal=Statistics and Computing |volume=25 |issue=4 |year=2015 |pages=781–795|doi=10.1007/s11222-015-9560-y|pmid=26139959 |pmc=4484776 }}</ref>
▲
The earliest, and prototypical, algorithms of this kind are the '''
==Robbins–Monro algorithm==
The Robbins–Monro algorithm, introduced in 1951 by [[Herbert Robbins]] and [[John U. Monro#Personal life|Sutton Monro]],<ref name="rm">{{Cite journal | last1 = Robbins | first1 = H. |
Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins|Robbins]] and Monro proved
* <math display="inline">N(\theta)</math> is uniformly bounded,
* <math display="inline">M(\theta)</math> is nondecreasing,
* <math display="inline">M'(\theta^*)</math> exists and is positive, and
* The sequence <math display="inline">a_n</math> satisfies the following requirements:
<math display="block">\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math>
A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: <math display="inline">a_n=a/n</math>, for <math display="inline"> a > 0 </math>. Other series, such as <math>a_n = \frac{1}{n \ln n}, \frac{1}{n \ln n \ln\ln n}, \dots</math> are possible but in order to average out the noise in <math display="inline">N(\theta)</math>, the above condition must be met. === Example ===
Consider the problem of estimating the mean <math>\theta^*</math> of a probability distribution from a stream of independent samples <math>X_1, X_2, \dots</math>.
Let <math>N(\theta) := \theta - X</math>, then the unique solution to <math display="inline">\operatorname E[N(\theta)] = 0</math> is the desired mean <math>\theta^*</math>. The RM algorithm gives us<math display="block">\theta_{n+1}=\theta_n - a_n(\theta_n - X_n) </math>This is equivalent to [[stochastic gradient descent]] with loss function <math>L(\theta) = \frac 12 \|X - \theta\|^2 </math>. It is also equivalent to a weighted average:<math display="block">\theta_{n+1}=(1-a_n)\theta_n + a_n X_n </math>In general, if there exists some function <math>L</math> such that <math>\nabla L(\theta) = N(\theta) - \alpha </math>, then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss function <math>L(\theta)</math>. However, the RM algorithm does not require <math>L</math> to exist in order to converge.
===Complexity results===
#If <math display="inline">f(\theta)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math display="inline">f(\theta)</math> belongs to the interior of <math display="inline">\Theta</math>, then the
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin
===Subsequent developments and
While the
Chung (1954)<ref>{{Cite journal|last=Chung|first=K. L.|date=1954-09-01|title=On a Stochastic Approximation Method|journal=The Annals of Mathematical Statistics|language=EN|volume=25|issue=3|pages=463–483|doi=10.1214/aoms/1177728716|issn=0003-4851|doi-access=free}}</ref>
'''''A1)'''''
Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the
A more general result is given in Chapter 11 of Kushner and Yin<ref>{{Cite book|url=https://www.springer.com/us/book/9780387008943|title=Stochastic Approximation and Recursive Algorithms and {{!}} Harold Kushner {{!}} Springer|
<math display="block">\theta^n(t)=\theta_{n+i},\quad U^n(t)=(\theta_{n+i}-\theta^*)/\sqrt{a_{n+i}}\quad\mbox{for}\quad t\in[t_{n+i}-t_n,t_{n+i+1}-t_n),i\ge0</math>Let the iterate average be <math>\Theta_n=\frac{a_n}{t}\sum_{i=n}^{n+t/a_n-1}\theta_i</math> and the associate normalized error to be <math>\hat{U}^n(t)=\frac{\sqrt{a_n}}{t}\sum_{i=n}^{n+t/a_n-1}(\theta_i-\theta^*)</math>.
Line 42 ⟶ 48:
With assumption '''A1)''' and the following '''A2)'''
'''''A2)''''' ''There is a Hurwitz matrix <math display="inline">A</math> and a symmetric and positive-definite matrix <math display="inline">\Sigma</math> such that <math display="inline">\{U^n(\cdot)\}</math> converges weakly to <math display="inline">U(\cdot)</math>, where <math display="inline">U(\cdot)</math> is the
satisfied, and define ''<math display="inline">\bar{V}=(A^{-1})'\Sigma(A')^{-1}</math>''. Then for each ''<math display="inline">t</math>'',
''<math display="block">\hat{U}^n(t)\stackrel{\mathcal{D}}{\longrightarrow}\mathcal{N}(0,V_t),\quad \
The success of the averaging idea is because of the time scale separation of the original sequence ''<math display="inline">\{\theta_n\}</math>'' and the averaged sequence ''<math display="inline">\{\Theta_n\}</math>'', with the time scale of the former one being faster.
=== Application in
Suppose we want to solve the following stochastic optimization problem<math display="block">g(\theta^*) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^*</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>,
▲<math display="block">g(\theta^*) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^*</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins-Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^*</math> if one can generate <math>(X_n)_{n\geq 0}
</math> , in which the conditional expectation of <math>X_n
Line 63 ⟶ 67:
Here <math>H(\theta, X)</math> is an unbiased estimator of <math>\nabla g(\theta)</math>. If <math>X</math> depends on <math>\theta</math>, there is in general no natural way of generating a random outcome <math>H(\theta, X)</math> that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator <math>H(\theta, X)</math>. If <math>X</math> is viewed as some "fundamental" underlying random process that is generated ''independently'' of <math>\theta</math>, and under some regularization conditions for derivative-integral interchange operations so that <math>\operatorname{E}\Big[\frac{\partial}{\partial\theta}Q(\theta,X)\Big] = \nabla g(\theta)</math>, then <math>H(\theta, X) = \frac{\partial}{\partial \theta}Q(\theta, X)</math> gives the fundamental gradient unbiased estimate. However, for some applications we have to use finite-difference methods in which <math>H(\theta, X)</math> has a conditional expectation close to <math>\nabla g(\theta)</math> but not exactly equal to it.
We then define a recursion analogously to [[Newton's Method]] in the deterministic algorithm:
: <math display="block">\theta_{n+1} = \theta_n - \
==== Convergence of the
The following result gives sufficient conditions on <math>\theta_n
</math> for the algorithm to converge:<ref>{{Cite book|title=Numerical Methods for stochastic Processes|
C1) <math>\
C2) <math>\sum_{n=0}^
</math>▼
C3) <math>\sum_{n=0}^{\infty}\
C4) <math>|X_n| \leq B, \text{ for a fixed bound }
C5) <math>g(\theta)
: <math display="block">\inf_{\delta\leq |\theta - \theta^*|\leq 1/\delta}\langle\theta-\theta^*, \nabla g(\theta)\rangle > 0,\text{ for every }
▲<math display="block">\inf_{\delta\leq |\theta - \theta^*|\leq 1/\delta}\langle\theta-\theta^*, \nabla g(\theta)\rangle > 0,\text{for every}\; 0< \delta < 1.
Here are some intuitive explanations about these conditions. Suppose <math>H(\theta_n, X_{n+1})</math> is a uniformly bounded random variables. If
</math> , then<math display="block">\theta_n - \theta_0 = -\sum_{i=0}^{n-1} \
</math>is a bounded sequence, so the iteration cannot converge to <math>\theta^* </math> if the initial guess <math>\theta_0▼
</math> is too far away from <math>\theta^* </math>. As for C3) note that if <math>\theta_n </math> converges to <math>\theta^* </math> then
<math display="block">\theta_{n+1} - \theta_n = -\varepsilon_n H(\theta_n, X_{n+1}) \rightarrow 0, \text{ as } n\rightarrow \infty.</math> so we must have <math>\varepsilon_n \downarrow 0 </math> ,and the condition C3) ensures it. A natural choice would be <math>\varepsilon_n = 1/n </math>. Condition C5) is a fairly stringent condition on the shape of <math>g(\theta)</math>; it gives the search direction of the algorithm.
==== Example (where the stochastic gradient method is appropriate)<ref name="jcsbook" /> ====▼
▲ </math> converges to <math>\theta^*
Suppose <math>Q(\theta, X) = f(\theta) + \theta^T X</math>, where <math>f</math> is differentiable and <math>X\in \mathbb{R}^p</math> is a random variable independent of <math>\theta</math>. Then <math>g(\theta)=\operatorname{E}[Q(\theta,X)] = f(\theta)+\theta^T\operatorname{E}X</math> depends on the mean of <math>X</math>, and the stochastic gradient method would be appropriate in this problem. We can choose <math>H(\theta, X) = \frac{\partial}{\partial\theta}Q(\theta,X) = \frac{\partial}{\partial\theta}f(\theta) + X.</math>▼
The
Let <math>M(x) </math> be a function which has a maximum at the point <math>\theta </math>. It is assumed that <math>M(x)</math> is unknown; however, certain observations <math>N(x)</math>, where <math>\operatorname E[N(x)] = M(x)</math>, can be made at any point <math>x</math>. The structure of the algorithm follows a gradient-like method, with the iterates being generated as
▲Here are some intuitive explanations about these conditions. Suppose <math>H(\theta_n, X_{n+1})</math> is a uniformly bounded random variables. If C2) is not satisfied, i.e. <math>\sum_{n=0}^{\infty}\epsilon_n < \infty
::<math> x_{n+1} = x_n + a_n
▲ </math> , then<math display="block">\theta_n - \theta_0 = -\sum_{i=0}^{n-1}\epsilon_i H(\theta_i, X_{i+1})
where <math>N(x_n+c_n)</math> and <math>N(x_n-c_n)</math> are independent
▲ </math>is a bounded sequence, so the iteration cannot converge to <math>\theta^*
▲==== Example (where the stochastic gradient method is appropriate)<ref name="jcsbook" /> ====
▲Suppose <math>Q(\theta, X) = f(\theta) + \theta^T X</math>, where <math>f</math> is differentiable and <math>X\in \mathbb{R}^p</math> is a random variable independent of <math>\theta</math>. Then <math>g(\theta)=\operatorname{E}[Q(\theta,X)] = f(\theta)+\theta^T\operatorname{E}X</math> depends on the mean of <math>X</math>, and the stochastic gradient method would be appropriate in this problem. We can choose <math>H(\theta, X) = \frac{\partial}{\partial\theta}Q(\theta,X) = \frac{\partial}{\partial\theta}f(\theta) + X.</math>
Kiefer and Wolfowitz proved that, if <math>M(x)</math> satisfied certain regularity conditions, then <math>x_n</math> will converge to <math>\theta</math> in probability as <math>n\to\infty
▲==Kiefer-Wolfowitz algorithm==
▲The Kiefer-Wolfowitz algorithm<ref name = "KW">{{Cite journal | last1 = Kiefer | first1 = J. | last2 = Wolfowitz | first2 = J. | doi = 10.1214/aoms/1177729392 | title = Stochastic Estimation of the Maximum of a Regression Function | journal = The Annals of Mathematical Statistics | volume = 23 | issue = 3 | pages = 462 | year = 1952 | pmid = | pmc = }}</ref> was introduced in 1952 by [[Jacob Wolfowitz]] and [[Jack_Kiefer_(statistician)|Jack Kiefer]], and was motivated by the publication of the Robbins-Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let <math>M(x) </math> be a function which has a maximum at the point <math>\theta </math>. It is assumed that <math>M(x)</math> is unknown; however, certain observations <math>N(x)</math>, where <math>\operatorname E[N(x)] = M(x)</math>, can be made at any point <math>x</math>. The structure of the algorithm follows a gradient-like method, with the iterates being generated as follows:
▲::<math> x_{n+1} = x_n + a_n \bigg(\frac{N(x_n + c_n) - N(x_n -c_n)}{2 c_n} \bigg) </math>
▲where <math>N(x_n+c_n)</math> and <math>N(x_n-c_n)</math> are independent, and the gradient of <math>M(x)</math> is approximated using finite differences. The sequence <math>\{c_n\}</math> specifies the sequence of finite difference widths used for the gradient approximation, while the sequence <math>\{a_n\}</math> specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if <math>M(x)</math> satisfied certain regularity conditions, then <math>x_n</math> will converge to <math>\theta</math> in probability as <math>n\to\infty
</math>, and later Blum<ref name=":0" /> in 1954 showed <math>x_n</math> converges to <math>\theta</math> almost surely, provided that:
* <math>\operatorname{Var}(N(x))\le S<\infty</math> for all <math>x</math>.
* The function <math>M(x)</math> has a unique point of maximum (minimum) and is strong concave (convex)
** The algorithm was first presented with the requirement that the function <math>M(\cdot)</math> maintains strong global convexity (concavity) over the entire feasible space. Given this condition is too restrictive to impose over the entire ___domain, Kiefer and Wolfowitz proposed that it is sufficient to impose the condition over a compact set <math>C_0 \subset \mathbb R^d</math> which is known to include the optimal solution.
Line 129 ⟶ 116:
** There exists <math>\beta>0</math> and <math>B>0</math> such that <math display="block">|x'-\theta|+|x''-\theta|<\beta \quad \Longrightarrow \quad |M(x')-M(x'')|<B|x'-x''|</math>
** There exists <math> \rho>0 </math> and <math> R>0 </math> such that <math display="block">|x'-x''|<\rho \quad \Longrightarrow \quad |M(x')-M(x'')|<R</math>
** For every <math> \delta>0 </math>, there exists some <math> \pi(\delta)>0 </math> such that <math display="block">|z-\theta|>\delta \quad \Longrightarrow \quad \inf_{\delta/2>\
*The selected sequences <math>\{a_n\}</math> and <math>\{c_n\}</math> must be infinite sequences of positive numbers such that
**<math>\quad c_n \rightarrow 0\quad \
**<math> \sum^\infty_{n=0} a_n =\infty </math>
**<math> \sum^\infty_{n=0} a_nc_n <\infty </math>
Line 139 ⟶ 126:
===Subsequent developments and important issues===
#The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least <math>d+1</math> different parameter values must be simulated for every iteration of the algorithm, where <math>d </math> is the dimension of the search space. This means that when <math>d</math> is large, the
## To address this problem, Spall proposed the use of [[Simultaneous perturbation stochastic approximation|simultaneous perturbations]] to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension <math>d</math>.<ref name = "Jsp">{{Cite journal | last1 = Spall | first1 = J. C. | title = Adaptive stochastic approximation by the simultaneous perturbation method | doi = 10.1109/TAC.2000.880982 | journal = IEEE Transactions on Automatic Control | volume = 45 | issue = 10 | pages = 1839–1853 | year = 2000
#In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the ___domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.
==Further developments==
An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.<ref name="kushneryin">{{Cite book | last1 = Kushner | first1 = H. J. |
[[C. Johan Masreliez]] and [[R. Douglas Martin]] were the first to apply
stochastic approximation to [[Robust statistics|robust]] [[estimation]].<ref>{{Cite journal | last1 = Martin | first1 = R. | last2 = Masreliez | first2 = C. | doi = 10.1109/TIT.1975.1055386 | title = Robust estimation via stochastic approximation | journal = IEEE Transactions on Information Theory | volume = 21 | issue = 3 | pages = 263 | year = 1975
The main tool for analyzing stochastic approximations algorithms (including the
| last = Dvoretzky | first = Aryeh | author-link = Aryeh Dvoretzky
| editor-last = Neyman | editor-first = Jerzy | editor-link = Jerzy Neyman
| contribution = On stochastic approximation
| contribution-url = https://projecteuclid.org/euclid.bsmsp/1200501645
| mr = 84911
| pages = 39–55
| publisher = University of California Press
| title = Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I
| year = 1956}}</ref>
==See also==
*[[Stochastic gradient descent]]
*[[Stochastic variance reduction]]
==References==
{{reflist}}
{{Statistics|collection}}
{{DEFAULTSORT:Stochastic Approximation}}
|