Revision as of 14:49, 15 February 2021 edit Ritalinowitz (talk \| contribs) 1 edit m →Robbins–Monro algorithm ← Previous edit		Revision as of 00:34, 31 March 2021 edit undo 174.255.132.225 (talk) No edit summary Tags: Reverted Visual edit Mobile edit Mobile web edit Next edit →
Line 58: </math> given <math>\theta_n </math> is exactly <math>\nabla g(\theta_n)</math>, i.e. <math>X_n</math> is simulated from a conditional distribution defined by ~~<math display="block">\operatorname{E}[H(\theta,X)\|\theta = \theta_n] = \nabla g(\theta_n).</math>~~ Here <math>H(\theta, X)</math> is an unbiased estimator of <math>\nabla g(\theta)</math>. If <math>X</math> depends on <math>\theta</math>, there is in general no natural way of generating a random outcome <math>H(\theta, X)</math> that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator <math>H(\theta, X)</math>. If <math>X</math> is viewed as some "fundamental" underlying random process that is generated ''independently'' of <math>\theta</math>, and under some regularization conditions for derivative-integral interchange operations so that <math>\operatorname{E}\Big[\frac{\partial}{\partial\theta}Q(\theta,X)\Big] = \nabla g(\theta)</math>, then <math>H(\theta, X) = \frac{\partial}{\partial \theta}Q(\theta, X)</math> gives the fundamental gradient unbiased estimate. However, for some applications we have to use finite-difference methods in which <math>H(\theta, X)</math> has a conditional expectation close to <math>\nabla g(\theta)</math> but not exactly equal to it. Line 94 ⟶ 92: <math display="block">\theta_{n+1} - \theta_n = -\varepsilon_n H(\theta_n, X_{n+1}) \rightarrow 0, \text{ as } n\rightarrow \infty.</math> so we must have <math>\varepsilon_n \downarrow 0 </math> ，and the condition C3) ensures it. A natural choice would be <math>\varepsilon_n = 1/n </math>. Condition C5) is a fairly stringent condition on the shape of <math>g(\theta)</math>; it gives the search direction of the algorithm. ==== Example (where the stochastic gradient method is appropriate)~~<ref name="jcsbook" />~~ ==== Suppose <math>Q(\theta, X) = f(\theta) + \theta^T X</math>, where <math>f</math> is differentiable and <math>X\in \mathbb{R}^p</math> is a random variable independent of <math>\theta</math>. Then <math>g(\theta)=\operatorname{E}[Q(\theta,X)] = f(\theta)+\theta^T\operatorname{E}X</math> depends on the mean of <math>X</math>~~, and the stochastic gradient method would be appropriate in this problem. We can choose~~ <math>H(\theta, X) = \frac{\partial}{\partial\theta}Q(\theta,X) = \frac{\partial}{\partial\theta}f(\theta) + X.</math> ==Kiefer–Wolfowitz algorithm==

Stochastic approximation: Difference between revisions