Content deleted Content added
Ancheta Wis (talk | contribs) m Reverted edits by 128.237.170.26 (talk) to last version by Saiteki-ka |
Magicheader (talk | contribs) |
||
Line 103:
==Extensions and variants==
Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge{{citation needed|date=October 2017}}; setting it too low makes it slow to converge{{citation needed|date=October 2017}}. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function {{mvar|η<sub>t</sub>}} of the iteration number {{mvar|t}}, giving a ''learning rate schedule'', so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on [[K-means clustering|{{mvar|k}}-means clustering]].<ref>Cited by {{cite conference |last1=Darken |first1=Christian |first2=John |last2=Moody |title=Fast adaptive k-means clustering: some empirical results |year=1990 |conference=Int'l Joint Conf. on Neural Networks (IJCNN) |publisher=IEEE|url=http://ieeexplore.ieee.org/abstract/document/5726679/}}</ref>
===Implicit updates (ISGD)===
As mentioned earlier, classical stochastic gradient descent is generally sensitive to learning rate {{mvar|η}}. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one:
:<math>w^{new} := w^{old} - \eta \nabla Q_i(w^{new}).</math>
This equation is implicit since <math>w^{new}</math> appears on both sides of the equation. As an example,
consider multidimensional least squares with features <math>x_1, \ldots, x_n \in\mathbb{R}^p</math> and observations
<math>y_1, \ldots, y_n\in\mathbb{R}</math>. We wish to solve:
:<math>\min_w \sum_{j=1}^n (y_j - x_j'w)^2.</math>
Note that <math>x</math> could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:
:<math>w^{new} = w^{old} + \eta (y_i - x_i'w^{old}) x_i</math>
where <math>i</math> is uniformly sampled between 1 and <math>n</math>. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when <math>\eta</math> is misspecified so that <math>I - \eta x_i x_i'</math> has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, ''implicit stochastic gradient descent'' (shortened as ISGD) can be solved in closed-form as:
:<math>w^{new} = w^{old} + \frac{\eta}{1 + \eta ||x_i||^2} (y_i - x_i'w^{old}) x_i.</math>
This procedure will remain numerically stable virtually for all <math>\eta</math> as the learning rate is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between [[Least mean squares filter|least mean squares (LMS)]] and
[[Least mean squares filter#Normalized_least_mean_squares_filter_(NLMS)| normalized least mean squares filter (NLMS)]].
Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that <math>Q_i(w)</math> depends on <math>w</math> only through a linear combination with features <math>x_i</math>, so that we can write <math>\nabla_w Q_i(w) = -q(x_i'w) x_i</math>, where
<math>q() \in\mathbb{R}</math> may depend on <math>x_i, y_i</math> as well but not on <math>w</math>. Least squares obeys this rule, and so does [[Logistic regression | logistic regression]], and most [[Generalized linear model| generalized linear models]]. For instance, in least squares, <math>q(x_i'w) = y_i - x_i'w</math>, and in logistic regression <math>q(x_i'w) = y_i - S(x_i'w)</math>, where <math>S(u) = e^u/(1+e^u)</math> is the [[logistic function]]. In [[Poisson regression]], <math>q(x_i'w) = y_i - e^{x_i'w}</math>, and so on.
In such settings, ISGD is simply implemented as follows:
:<math>w^{new} = w^{old} + \xi x_i,\quad\xi = \eta q(x_i'w^{old} + \xi ||x_i||^2).</math>
The scaling factor <math>\xi\in\mathbb{R}</math> can be found through [[Bisection method| bisection method]] since
in most regular models, such as the aforementioned generalized linear models, function <math>q()</math> is decreasing,
and thus the search bounds for <math>\xi</math> are
<math>[\min(0, b_i), \max(0, b_i)]</math>, where <math>b_i = \eta q(x_i'w^{old})</math>.
===Momentum===
|