Stochastic approximation: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:08, 30 May 2024 edit David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,660 edits →Subsequent developments and Polyak–Ruppert averaging: fix refs ← Previous edit		Latest revision as of 08:32, 27 January 2025 edit undo 134.10.78.35 (talk) →Robbins–Monro algorithm Tag: Visual edit
(11 intermediate revisions by 4 users not shown)
Line 11: ==Robbins–Monro algorithm== The Robbins–Monro algorithm, introduced in 1951 by [[Herbert Robbins]] and [[John U. Monro#Personal life\|Sutton Monro]],<ref name="rm">{{Cite journal \| last1 = Robbins \| first1 = H. \| author-link = Herbert Robbins\| last2 = Monro \| first2 = S. \| doi = 10.1214/aoms/1177729586 \| title = A Stochastic Approximation Method \| journal = The Annals of Mathematical Statistics \| volume = 22 \| issue = 3 \| pages = 400 \| year = 1951 \| doi-access = free }}</ref> presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function <math display="inline">M(\theta)</math>, and a constant <math display="inline">\alpha</math>, such that the equation <math display="inline">M(\theta) = \alpha</math> has a unique root at <math display="inline">\theta^.</math>. It is assumed that while we cannot directly observe the function <math display="inline">M(\theta),</math>, we can instead obtain measurements of the random variable <math display="inline">N(\theta)</math> where <math display="inline">\operatorname E[N(\theta)] = M(\theta)</math>. The structure of the algorithm is to then generate iterates of the form: ::<math display="block">\theta_{n+1}=\theta_n - a_n(N(\theta_n) - \alpha)</math> Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins\|Robbins]] and Monro proved<ref name="rm" /><sup>, Theorem 2</sup> that <math>\theta_n</math> [[convergence of random variables\|converges]] in <math>L^2</math> (and hence also in probability) to <math>\theta^</math>, and Blum<ref name=":0">{{Cite journal\|last=Blum\|first=Julius R.\|date=1954-06-01\|title=Approximation Methods which Converge with Probability one\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=25\|issue=2\|pages=382–386\|doi=10.1214/aoms/1177728794\|issn=0003-4851\|doi-access=free}}</ref> later proved the convergence is actually with probability one, provided that: Line 20: * <math display="inline">M'(\theta^)</math> exists and is positive, and The sequence <math display="inline">a_n</math> satisfies the following requirements: ~~:: <math>\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math>~~ <math display="block">\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math> A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: <math display="inline">a_n=a/n</math>, for <math display="inline"> a > 0 </math>. Other series, such as <math>a_n = \frac{1}{n \ln n}, \frac{1}{n \ln n \ln\ln n}, \dots</math> are possible but in order to average out the noise in <math display="inline">N(\theta)</math>, the above condition must be met. === Example === Consider the problem of estimating the mean <math>\theta^</math> of a probability distribution from a stream of independent samples <math>X_1, X_2, \dots</math>. Let <math>N(\theta) := \theta - X</math>, then the unique solution to <math display="inline">\operatorname E[N(\theta)] = 0</math> is the desired mean <math>\theta^</math>. The RM algorithm gives us<math display="block">\theta_{n+1}=\theta_n - a_n(\theta_n - X_n) </math>This is equivalent to [[stochastic gradient descent]] with loss function <math>L(\theta) = \frac 12 \\|X - \theta\\|^2 </math>. It is also equivalent to a weighted average:<math display="block">\theta_{n+1}=(1-a_n)\theta_n + a_n X_n </math>In general, if there exists some function <math>L</math> such that <math>\nabla L(\theta) = N(\theta) - \alpha </math>, then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss function <math>L(\theta)</math>. However, the RM algorithm does not require <math>L</math> to exist in order to converge. ===Complexity results=== Line 33 ⟶ 38: Chung (1954)<ref>{{Cite journal\|last=Chung\|first=K. L.\|date=1954-09-01\|title=On a Stochastic Approximation Method\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=25\|issue=3\|pages=463–483\|doi=10.1214/aoms/1177728716\|issn=0003-4851\|doi-access=free}}</ref> and Fabian (1968)<ref>{{Cite journal\|last=Fabian\|first=Vaclav\|date=1968-08-01\|title=On Asymptotic Normality in Stochastic Approximation\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=39\|issue=4\|pages=1327–1332\|doi=10.1214/aoms/1177698258\|issn=0003-4851\|doi-access=free}}</ref> showed that we would achieve optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> with <math display="inline">a_n=\bigtriangledown^2f(\theta^)^{-1}/n</math> (or <math display="inline">a_n=\frac{1}{(nM'(\theta^))}</math>). Lai and Robbins<ref>{{Cite journal\|last1=Lai\|first1=T. L.\|last2=Robbins\|first2=Herbert\|date=1979-11-01\|title=Adaptive Design and Stochastic Approximation\|journal=The Annals of Statistics\|language=EN\|volume=7\|issue=6\|pages=1196–1221\|doi=10.1214/aos/1176344840\|issn=0090-5364\|doi-access=free}}</ref><ref>{{Cite journal\|last1=Lai\|first1=Tze Leung\|last2=Robbins\|first2=Herbert\|date=1981-09-01\|title=Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes\|journal=Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete\|language=en\|volume=56\|issue=3\|pages=329–360\|doi=10.1007/BF00536178\|s2cid=122109044\|issn=0044-3719\|doi-access=free}}</ref> designed adaptive procedures to estimate <math display="inline">M'(\theta^)</math> such that <math display="inline">\theta_n</math> has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak (1991)<ref>{{Cite journal\|last=Polyak\|first=B T\|date=1991\|title=New stochastic approximation type procedures. (In Russian.)\|url=https://www.researchgate.net/publication/236736759\|journal=Automation and Remote Control\|volume=7\|issue=7}}</ref> and Ruppert (1988)<ref>{{Cite report\|last=Ruppert\|first=David\|title=Efficient estimators from a slowly converging robbins-monro process\|url=https://www.researchgate.net/publication/242608650\|type=Technical Report 781\|publisher=Cornell University School of Operations Research and Industrial Engineering\|year=1988}}</ref> independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky<ref name="pj">{{Cite journal \| last1 = Polyak \| first1 = B. T. \| last2 = Juditsky \| first2 = A. B. \| doi = 10.1137/0330046 \| title = Acceleration of Stochastic Approximation by Averaging \| journal = SIAM Journal on Control and Optimization \| volume = 30 \| issue = 4 \| pages = 838 \| year = 1992 }}</ref> also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:<math display="block"> \theta_{n+1} - \theta_n = a_n(\alpha - N(\theta_n)), \qquad \bar{\theta}_n = \frac{1}{n} \sum^{n-1}_{i=0} \theta_i </math>The convergence of <math> \bar{\theta}_n </math> to the unique root <math>\theta^</math> relies on the condition that the step sequence <math>\{a_n\}</math> decreases sufficiently slowly. That is '''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math> Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> yet with a more robust step size policy.<ref name="pj" /> Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin<ref name="NY">On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, ''Dokl. Akad. Nauk SSR'' '''2939''', (1978 (Russian)), Soviet Math. Dokl. '''19''' (1978 (English)).</ref> for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate <math display="inline">O(1/\sqrt{n})</math>. Line 43 ⟶ 48: With assumption '''A1)''' and the following '''A2)''' '''''A2)''''' ''There is a Hurwitz matrix <math display="inline">A</math> and a symmetric and positive-definite matrix <math display="inline">\Sigma</math> such that <math display="inline">\{U^n(\cdot)\}</math> converges weakly to <math display="inline">U(\cdot)</math>, where <math display="inline">U(\cdot)</math> is the statisolution to'' <math display="block">dU = AU \, dt +\Sigma^{1/2} \, dw</math>where <math display="inline">w(\cdot)</math> is a standard Wiener process.'' satisfied, and define ''<math display="inline">\bar{V}=(A^{-1})'\Sigma(A')^{-1}</math>''. Then for each ''<math display="inline">t</math>'', Line 52 ⟶ 57: === Application in stochastic optimization === Suppose we want to solve the following stochastic optimization problem<math display="block">g(\theta^) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins–Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^</math> if one can generate <math>(X_n)_{n\geq 0}▼ ~~Suppose we want to solve the following stochastic optimization problem~~ ▲<math display="block">g(\theta^) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins–Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^</math> if one can generate <math>(X_n)_{n\geq 0} </math> , in which the conditional expectation of <math>X_n Line 71 ⟶ 74: The following result gives sufficient conditions on <math>\theta_n </math> for the algorithm to converge:<ref>{{Cite book\|title=Numerical Methods for stochastic Processes\|last1=Bouleau\|first1=N.\|author-link=Nicolas Bouleau\|last2=Lepingle\|first2=D.\|publisher=John Wiley\|year=1994\|isbn=9780471546412\|___location=New York\|url=https://books.google.com/books?id=9MLL2RN40asC}}</ref> C1) <math>\varepsilon_n \geq 0, \forall\; n\geq 0. </math>