Stochastic gradient descent: Difference between revisions

Content deleted Content added
Line 104:
Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge{{citation needed|date=October 2017}}; setting it too low makes it slow to converge{{citation needed|date=October 2017}}. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function {{mvar|η<sub>t</sub>}} of the iteration number {{mvar|t}}, giving a ''learning rate schedule'', so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on [[K-means clustering|{{mvar|k}}-means clustering]].<ref>Cited by {{cite conference |last1=Darken |first1=Christian |first2=John |last2=Moody |title=Fast adaptive k-means clustering: some empirical results |year=1990 |conference=Int'l Joint Conf. on Neural Networks (IJCNN) |publisher=IEEE|url=http://ieeexplore.ieee.org/abstract/document/5726679/}}</ref>
 
===Implicit updatesstochastic gradient descent (ISGD)===
As mentioned earlier, classical stochastic gradient descent is generally sensitive to learning rate {{mvar|η}}. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one:
:<math>w^{new} := w^{old} - \eta \nabla Q_i(w^{new}).</math>
Line 131:
and thus the search bounds for <math>\xi</math> are
<math>[\min(0, b_i), \max(0, b_i)]</math>, where <math>b_i = \eta q(x_i'w^{old})</math>.
 
 
 
===Momentum===