Content deleted Content added
Rchard2scout (talk | contribs) m Fix lint errors |
|||
(10 intermediate revisions by 3 users not shown) | |||
Line 11:
==Robbins–Monro algorithm==
The Robbins–Monro algorithm, introduced in 1951 by [[Herbert Robbins]] and [[John U. Monro#Personal life|Sutton Monro]],<ref name="rm">{{Cite journal | last1 = Robbins | first1 = H. | author-link = Herbert Robbins| last2 = Monro | first2 = S. | doi = 10.1214/aoms/1177729586 | title = A Stochastic Approximation Method | journal = The Annals of Mathematical Statistics | volume = 22 | issue = 3 | pages = 400 | year = 1951 | doi-access = free }}</ref> presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function <math display="inline">M(\theta)</math>, and a constant <math display="inline">\alpha</math>, such that the equation <math display="inline">M(\theta) = \alpha</math> has a unique root at <math display="inline">\theta^*.</math>
Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins|Robbins]] and Monro proved<ref name="rm" /><sup>, Theorem 2</sup> that <math>\theta_n</math> [[convergence of random variables|converges]] in <math>L^2</math> (and hence also in probability) to <math>\theta^*</math>, and Blum<ref name=":0">{{Cite journal|last=Blum|first=Julius R.|date=1954-06-01|title=Approximation Methods which Converge with Probability one|journal=The Annals of Mathematical Statistics|language=EN|volume=25|issue=2|pages=382–386|doi=10.1214/aoms/1177728794|issn=0003-4851|doi-access=free}}</ref> later proved the convergence is actually with probability one, provided that:
Line 20:
* <math display="inline">M'(\theta^*)</math> exists and is positive, and
* The sequence <math display="inline">a_n</math> satisfies the following requirements:
<math display="block">\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math>
A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: <math display="inline">a_n=a/n</math>, for <math display="inline"> a > 0 </math>. Other series, such as <math>a_n = \frac{1}{n \ln n}, \frac{1}{n \ln n \ln\ln n}, \dots</math> are possible but in order to average out the noise in <math display="inline">N(\theta)</math>, the above condition must be met. === Example ===
Consider the problem of estimating the mean <math>\theta^*</math> of a probability distribution from a stream of independent samples <math>X_1, X_2, \dots</math>.
Let <math>N(\theta) := \theta - X</math>, then the unique solution to <math display="inline">\operatorname E[N(\theta)] = 0</math> is the desired mean <math>\theta^*</math>. The RM algorithm gives us<math display="block">\theta_{n+1}=\theta_n - a_n(\theta_n - X_n) </math>This is equivalent to [[stochastic gradient descent]] with loss function <math>L(\theta) = \frac 12 \|X - \theta\|^2 </math>. It is also equivalent to a weighted average:<math display="block">\theta_{n+1}=(1-a_n)\theta_n + a_n X_n </math>In general, if there exists some function <math>L</math> such that <math>\nabla L(\theta) = N(\theta) - \alpha </math>, then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss function <math>L(\theta)</math>. However, the RM algorithm does not require <math>L</math> to exist in order to converge.
===Complexity results===
Line 52 ⟶ 57:
=== Application in stochastic optimization ===
Suppose we want to solve the following stochastic optimization problem<math display="block">g(\theta^*) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^*</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins–Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^*</math> if one can generate <math>(X_n)_{n\geq 0}▼
▲<math display="block">g(\theta^*) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^*</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins–Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^*</math> if one can generate <math>(X_n)_{n\geq 0}
</math> , in which the conditional expectation of <math>X_n
Line 71 ⟶ 74:
The following result gives sufficient conditions on <math>\theta_n
</math> for the algorithm to converge:<ref>{{Cite book|title=Numerical Methods for stochastic Processes|last1=Bouleau|first1=N.|author-link=Nicolas Bouleau|last2=Lepingle|first2=D.|publisher=John Wiley|year=1994|isbn=9780471546412|___location=New York|url=https://books.google.com/books?id=9MLL2RN40asC}}</ref>
C1) <math>\varepsilon_n \geq 0, \forall\; n\geq 0. </math>
|