Content deleted Content added
m fix link from previous edit. Tag: Disambiguation links added |
Added examples. |
||
(94 intermediate revisions by 40 users not shown) | |||
Line 3:
{{Machine learning}}
[[File:Gradient Descent in 2D.webm|thumb|right|Gradient Descent in 2D]]
It is particularly useful in machine learning for minimizing the cost or loss function.<ref>{{cite web | url=https://keepnotes.com/document/stanford-university/319-gradient-descent-intuition | title=Gradient descent intuition | Stanford University - KeepNotes }}</ref> Gradient descent should not be confused with [[Local search (optimization)|local search]] algorithms, although all are [[Iterative method|iterative methods]] for [[Global optimization|optimization]].▼
The idea is to take repeated steps in the opposite direction of the [[gradient]] (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a trajectory that maximizes that function; the procedure is then known as ''gradient ascent''.
Gradient descent is generally attributed to [[Augustin-Louis Cauchy]], who first suggested it in 1847.<ref>{{cite journal |first=C. |last=Lemaréchal |author-link=Claude Lemaréchal |title=Cauchy and the Gradient Method |journal=Doc Math Extra |pages=251–254 |year=2012 |url=https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf }}</ref> [[Jacques Hadamard]] independently proposed a similar method in 1907.<ref>{{Cite journal|last=Hadamard|first=Jacques|date=1908|title=Mémoire sur le problème d'analyse relatif à l'équilibre des plaques élastiques encastrées|journal=Mémoires présentés par divers savants éstrangers à l'Académie des Sciences de l'Institut de France|volume=33}}</ref><ref>{{cite journal |last1=Courant |first1=R. |title=Variational methods for the solution of problems of equilibrium and vibrations |journal=Bulletin of the American Mathematical Society |date=1943 |volume=49 |issue=1 |pages=1–23 |doi=10.1090/S0002-9904-1943-07818-4 |doi-access=free }}</ref> Its convergence properties for non-linear optimization problems were first studied by [[Haskell Curry]] in 1944,<ref>{{cite journal |first=Haskell B. |last=Curry |title=The Method of Steepest Descent for Non-linear Minimization Problems |journal=Quart. Appl. Math. |volume=2 |year=1944 |issue=3 |pages=258–261 |doi=10.1090/qam/10667 |doi-access=free }}</ref> with the method becoming increasingly well-studied and used in the following decades.<ref name="BP" /><ref name="AK82" />▼
▲It is particularly useful in machine learning for minimizing the cost or loss function.<ref name="auto">{{
▲Gradient descent is generally attributed to [[Augustin-Louis Cauchy]], who first suggested it in 1847.<ref>{{cite
A simple extension of gradient descent, [[stochastic gradient descent]], serves as the most basic algorithm used for training most [[deep network]]s today.▼
▲A simple extension of gradient descent, [[stochastic gradient descent]], serves as the most basic algorithm used for training most [[deep neural network|deep network]]s today.
==Description==
[[File:Gradient descent.svg|thumb|350px|Illustration of gradient descent on a series of [[level set]]s]]
Gradient descent is based on the observation that if the [[multi-variable function]] <math>
:<math> \mathbf{a}_{n+1} = \mathbf{a}_n-\
for a small enough step size or [[learning rate]] <math>\
:<math>\mathbf{x}_{n+1}=\mathbf{x}_n-\
We have a [[Monotonic function|monotonic]] sequence
:<math>
so the sequence <math>(\mathbf{x}_n)</math> converges to the desired local minimum. Note that the value of the ''step size'' <math>\eta</math> is allowed to change at every iteration.
as in the [[Barzilai-Borwein method]],<ref>{{cite journal |last1=Barzilai |first1=Jonathan |last2=Borwein |first2=Jonathan M. |year=1988 |title=Two-Point Step Size Gradient Methods |journal=IMA Journal of Numerical Analysis |volume=8 |issue=1 |pages=141–148 |doi=10.1093/imanum/8.1.141}}</ref><ref>{{cite book |last=Fletcher |first=R. |title=Optimization and Control with Applications |publisher=Springer |year=2005 |isbn=0-387-24254-6 |editor-last=Qi |editor-first=L. |series=Applied Optimization |volume=96 |___location=Boston |pages=235–256 |chapter=On the Barzilai–Borwein Method |editor2-last=Teo |editor2-first=K. |editor3-last=Yang |editor3-first=X.}}</ref> or a sequence <math> \eta_n</math> satisfying the [[Wolfe conditions]] (which can be found by using [[line search]]). When the function <math>f</math> is [[Convex function|convex]], all local minima are also global minima, so in this case gradient descent can converge to the global solution.
This process is illustrated in the adjacent picture. Here, <math>
===
[[File:Okanogan-Wenatchee National Forest, morning fog shrouds trees (37171636495).jpg|thumb|Fog in the mountains]]
The basic intuition behind gradient descent can be illustrated by a hypothetical scenario.
In this analogy, the
===
Since using a step size <math>\
To reason about this mathematically, consider a direction <math> \mathbf{p}_n</math> and step size <math> \
:<math> \mathbf{a}_{n+1} = \mathbf{a}_n-\
Finding good settings of <math> \mathbf{p}_n</math> and <math> \
{{NumBlk|:|<math>
This inequality implies that the amount by which we can be sure the function <math>
In principle inequality ({{EquationNote|1}}) could be optimized over <math> \mathbf{p}_n</math> and <math> \
* Forgo the benefits of a clever descent direction by setting <math>\mathbf{p}_n = \nabla
* Assuming that <math>
* Assuming that <math>\nabla
* Build a custom model of <math> \max_{t\in[0,1]} \frac{\|\nabla
* Under stronger assumptions on the function <math>
Usually by following one of the recipes above, [[convergent series|convergence]] to a local minimum can be guaranteed. When the function <math>
==Solution of a linear system==
[[File:Steepest descent.png|thumb|380px|The steepest descent algorithm applied to the [[Wiener filter]]<ref>Haykin, Simon S. Adaptive filter theory. Pearson Education India, 2008. - p. 108-142, 217-242</ref>]]
Gradient descent can be used to solve a [[system of linear equations]]
:<math>\mathbf{A}\mathbf{x}-\mathbf{b}=0</math>
reformulated as a quadratic minimization problem.
If the system matrix <math>\mathbf{A}</math> is real [[Symmetric matrix|symmetric]] and [[positive-definite matrix|positive-definite]], an objective function is defined as the quadratic function, with minimization of
:<math>
so that
:<math>\nabla
For a general real matrix <math>\mathbf{A}</math>, [[linear least squares]] define
:<math>
In traditional linear least squares for real <math>\mathbf{A}</math> and <math>\mathbf{b}</math> the [[Euclidean norm]] is used, in which case
:<math>\nabla
The [[line search]] minimization, finding the locally optimal step size <math>\
For example, for real [[Symmetric matrix|symmetric]] and [[positive-definite matrix|positive-definite]] matrix <math>\mathbf{A}</math>, a simple algorithm can be as follows,<ref name="BP" />
:<math>\begin{align}
& \text{repeat in the loop:} \\
& \qquad \mathbf{r} := \mathbf{b} - \mathbf{A x} \\
& \qquad \
& \qquad \mathbf{x} := \mathbf{x} + \
& \qquad \hbox{if } \mathbf{r}^\
& \text{end repeat loop} \\
& \text{return } \mathbf{x} \text{ as the result}
\end{align}</math>
To avoid multiplying by <math>\mathbf{A}</math> twice per iteration,
we note that <math>\mathbf{x} := \mathbf{x} + \
:<math>\begin{align}
& \mathbf{r} := \mathbf{b} - \mathbf{A x} \\
& \text{repeat in the loop:} \\
& \qquad \
& \qquad \mathbf{x} := \mathbf{x} + \
& \qquad \hbox{if } \mathbf{r}^\
& \qquad \mathbf{r} := \mathbf{r} - \
& \text{end repeat loop} \\
& \text{return } \mathbf{x} \text{ as the result}
\end{align}</math>
The method is rarely used for solving linear equations, with the [[conjugate gradient method]] being one of the most popular alternatives. The number of gradient descent iterations is commonly proportional to the spectral [[condition number]] <math>\kappa(\mathbf{A})</math> of the system matrix <math>\mathbf{A}</math> (the ratio of the maximum to minimum [[eigenvalues]] of {{nowrap|<math>\mathbf{A}^
===Geometric behavior and residual orthogonality===
==Solution of a non-linear system==▼
In steepest descent applied to solving <math> \mathbf{A x} = \mathbf{b} </math>, where <math> \mathbf{A} </math> is symmetric positive-definite, the residual vectors <math> \mathbf{r}_k = \mathbf{b} - \mathbf{A}\mathbf{x}_k </math> are orthogonal across iterations:
:<math>
\langle \mathbf{r}_{k+1}, \mathbf{r}_k\rangle = 0.
</math>
Because each step is taken in the steepest direction, steepest-descent steps alternate between directions aligned with the extreme axes of the elongated level sets. When <math>\kappa(\mathbf{A})</math> is large, this produces a characteristic zig–zag path. The poor conditioning of <math> \mathbf{A} </math> is the primary cause of the slow convergence, and orthogonality of successive residuals reinforces this alternation.
[[File:Steepest descent convergence path for A = 2 2, 2 3.png|thumb|Convergence path of steepest descent method for A = [[2, 2], [2, 3]]]]
As shown in the image on the right, steepest descent converges slowly due to the high condition number of <math> \mathbf{A} </math>, and the orthogonality of residuals forces each new direction to undo the overshoot from the previous step. The result is a path that zigzags toward the solution. This inefficiency is one reason conjugate gradient or preconditioning methods are preferred.<ref>{{Cite book | author1=Holmes, M. | title=Introduction to Scientific Computing and Data Analysis, 2nd Ed | year=2023 | publisher=Springer | isbn=978-3-031-22429-4 }}</ref>
▲==Solution of a non-linear system==
Gradient descent can also be used to solve a system of [[nonlinear equation]]s. Below is an example that shows how to use the gradient descent to solve for three unknown variables, ''x''<sub>1</sub>, ''x''<sub>2</sub>, and ''x''<sub>3</sub>. This example shows one iteration of the gradient descent.
Line 141 ⟶ 156:
One might now define the objective function
:<math>\begin{align}
&{}\qquad\left. \left(\exp(-x_1x_2) + 20x_3 + \frac{10\pi-3}{3} \right)^2 \right],\end{align}</math>
Line 153 ⟶ 168:
We know that
:<math>\mathbf{x}^{(1)}=\mathbf{0}-\
where the [[Jacobian matrix]] <math>J_G</math> is given by
Line 177 ⟶ 192:
Thus
:<math>\mathbf{x}^{(1)}= \mathbf{0}-\
-7.5\\
-2\\
Line 185 ⟶ 200:
and
:<math>
[[File:Gradient Descent Example Nonlinear Equations.gif|thumb|right|350px|An animation showing the first 83 iterations of gradient descent applied to this example. Surfaces are [[isosurface]]s of <math>
Now, a suitable <math>\
:<math>
This can be done with any of a variety of [[line search]] algorithms. One might also simply guess <math>\
:<math> \mathbf{x}^{(1)}=\begin{bmatrix}
Line 203 ⟶ 218:
Evaluating the objective function at this value, yields
:<math>
The decrease from <math>
:<math>
is a sizable decrease in the objective function. Further steps would reduce its value further until an approximate solution to the system was found.
==Comments==
Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case, the search space is typically a [[function space]], and one calculates the [[Fréchet derivative]] of the functional to be minimized to determine the descent direction.<ref name="AK82">{{cite book |first1=G. P. |last1=Akilov |first2=L. V. |last2=Kantorovich |author-link2=Leonid Kantorovich |title=Functional Analysis |publisher=Pergamon Press |edition=2nd |isbn=0-08-023036-9 |year=1982 }}</ref>
That gradient descent works in any number of dimensions (finite number at least) can be seen as a consequence of the [[
The gradient descent can take many iterations to compute a local minimum with a required [[accuracy]], if the [[curvature]] in different directions is very different for the given function. For such functions, [[preconditioning]], which changes the geometry of the space to shape the function level sets like [[concentric circles]], cures the slow convergence. Constructing and applying preconditioning can be computationally expensive, however.
The gradient descent can be modified via momentums<ref>{{Cite journal |last1=Abdulkadirov |first1=Ruslan |last2=Lyakhov |first2=Pavel |last3=Nagornov |first3=Nikolay |date=January 2023 |title=Survey of Optimization Algorithms in Modern Neural Networks |journal=Mathematics |language=en |volume=11 |issue=11 |pages=2466 |doi=10.3390/math11112466 |doi-access=free |issn=2227-7390}}</ref> ([[Nesterov]], Polyak,<ref>{{Cite journal |last1=Diakonikolas |first1=Jelena |last2=Jordan |first2=Michael I. |date=January 2021 |title=Generalized Momentum-Based Methods: A Hamiltonian Perspective |url=https://epubs.siam.org/doi/10.1137/20M1322716 |journal=SIAM Journal on Optimization |language=en |volume=31 |issue=1 |pages=915–944 |doi=10.1137/20M1322716 |arxiv=1906.00436 |issn=1052-6234}}</ref> and Frank–Wolfe<ref>{{Cite journal |last=Meyer |first=Gerard G. L. |date=November 1974 |title=Accelerated Frank–Wolfe Algorithms |url=http://epubs.siam.org/doi/10.1137/0312050 |journal=SIAM Journal on Control |language=en |volume=12 |issue=4 |pages=655–663 |doi=10.1137/0312050 |issn=0036-1402|url-access=subscription }}</ref>) and heavy-ball parameters (exponential moving averages<ref>{{Citation |last1=Kingma |first1=Diederik P. |title=Adam: A Method for Stochastic Optimization |date=2017-01-29 |last2=Ba |first2=Jimmy|arxiv=1412.6980 }}</ref> and positive-negative momentum<ref>{{Cite journal |last1=Xie |first1=Zeke |last2=Yuan |first2=Li |last3=Zhu |first3=Zhanxing |last4=Sugiyama |first4=Masashi |date=2021-07-01 |title=Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization |url=https://proceedings.mlr.press/v139/xie21h.html |journal=Proceedings of the 38th International Conference on Machine Learning |language=en |publisher=PMLR |pages=11448–11458|arxiv=2103.17182 }}</ref>). The main examples of such optimizers are Adam, DiffGrad, Yogi, AdaBelief, etc.
Methods based on [[Newton's method in optimization|Newton's method]] and inversion of the [[Hessian matrix|Hessian]] using [[conjugate gradient]] techniques can be better alternatives.<ref>{{cite book |first1=W. H. |last1=Press |author-link1 = William H. Press |first2=S. A. |last2=Teukolsky |author-link2 = Saul Teukolsky |first3=W. T. |last3=Vetterling |first4=B. P. |last4=Flannery |author-link4 = Brian P. Flannery |title=Numerical Recipes in C: The Art of Scientific Computing |url=https://archive.org/details/numericalrecipes00pres_0 |url-access=registration |edition=2nd |publisher=[[Cambridge University Press]] |___location=New York |year=1992 |isbn=0-521-43108-5 }}</ref><ref>{{cite book |first=T. |last=Strutz |title=Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond |edition=2nd |publisher=Springer Vieweg |year=2016 |isbn=978-3-658-11455-8 }}</ref> Generally, such methods converge in fewer iterations, but the cost of each iteration is higher. An example is the [[Broyden–Fletcher–Goldfarb–Shanno algorithm|BFGS method]] which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated [[line search]] algorithm, to find the "best" value of <math>\
While it is sometimes possible to substitute gradient descent for a [[Local search (optimization)|local search]] algorithm, gradient descent is not in the same family: although it is an [[iterative method]] for [[Global optimization|local optimization]], it relies on an [[loss function|objective function’s gradient]] rather than an explicit exploration of a [[Feasible region|solution space]].
Gradient descent can be viewed as applying [[Euler's method]] for solving [[ordinary differential equations]] <math>x'(t)=-\nabla f(x(t))</math> to a [[gradient flow]]. In turn, this equation may be derived as an optimal controller<ref>{{cite journal |last1=Ross |first1=I.M. |title=An optimal control theory for nonlinear optimization |journal=Journal of Computational and Applied Mathematics |date=July 2019 |volume=354 |pages=39–51 |doi=10.1016/j.cam.2018.12.044 |s2cid=127649426 |doi-access=free }}</ref> for the control system <math>x'(t) = u(t)</math> with <math>u(t)</math> given in feedback form <math>u(t) = -\nabla f(x(t))</math>.
==Modifications==
Gradient descent can converge to a local minimum and slow down in a neighborhood of a [[saddle point]]. Even for unconstrained quadratic minimization, gradient descent develops a
===Fast gradient methods===
[[Yurii Nesterov]] has proposed<ref>{{cite book |first=Yurii |last=Nesterov |author-link=Yurii Nesterov |title=Introductory Lectures on Convex Optimization : A Basic Course |publisher=Springer |year=2004 |isbn=1-4020-7553-7 }}</ref> a simple modification that enables faster convergence for convex problems and has been since further generalized. For unconstrained smooth problems, the method is called the [[fast gradient method]] (FGM) or the [[accelerated gradient method]] (AGM). Specifically, if the differentiable function <math>
For constrained or non-smooth problems, Nesterov's FGM is called the [[fast proximal gradient method]] (FPGM), an acceleration of the [[proximal gradient method]].
===Momentum or ''heavy ball'' method===
Trying to break the zig-zag pattern of gradient descent, the ''momentum or heavy ball method'' uses a momentum term in analogy to a heavy ball sliding on the surface of values of the function being minimized,<ref name="BP" /> or to mass movement in [[Newtonian dynamics]] through a [[viscous]] medium in a [[conservative force]] field.<ref>{{cite journal|last1=Qian |first1=Ning |title=On the momentum term in gradient descent learning algorithms |journal=[[Neural Networks (journal)|Neural Networks]] |date=January 1999 |volume=12 |issue=1 |pages=145–151 |doi=10.1016/S0893-6080(98)00116-6 |pmid=12662723 |citeseerx=10.1.1.57.5612 |s2cid=2783597 }}</ref> Gradient descent with momentum remembers the solution update at each iteration, and determines the next update as a [[linear combination]] of the gradient and the previous update. For unconstrained quadratic minimization, a theoretical convergence rate bound of the heavy ball method is asymptotically the same as that for the optimal [[conjugate gradient method]].<ref name="BP" />
This technique is used in [[Stochastic gradient descent#Momentum|stochastic gradient descent]] and as an extension to the [[backpropagation]] algorithms used to train [[artificial neural network]]s.<ref>{{cite web|title=Momentum and Learning Rate Adaptation|url=http://www.willamette.edu/~gorr/classes/cs449/momrate.html|publisher=[[Willamette University]]|access-date=17 October 2014}}</ref><ref>{{cite web|author1=Geoffrey Hinton|author-link=Geoffrey Hinton|author2=Nitish Srivastava|author3=Kevin Swersky|title=The momentum method|url=https://www.coursera.org/lecture/neural-networks/the-momentum-method-Oya9a|website=[[Coursera]]|access-date=2 October 2018}} Part of a lecture series for the [[Coursera]] online course [https://www.coursera.org/learn/neural-networks Neural Networks for Machine Learning] {{Webarchive|url=https://web.archive.org/web/20161231174321/https://www.coursera.org/learn/neural-networks |date=2016-12-31 }}.</ref> In the direction of updating, stochastic gradient descent adds a stochastic property. The weights can be used to calculate the derivatives.
==Extensions==
Gradient descent can be extended to handle [[Constraint (mathematics)|constraints]] by including a [[Projection (linear algebra)|projection]] onto the set of constraints. This method is only feasible when the projection is efficiently computable on a computer. Under suitable assumptions, this method converges. This method is a specific case of the [[
Gradient descent is a special case of [[mirror descent]] using the squared Euclidean distance as the given [[Bregman divergence]].<ref>{{cite web | url=https://tlienart.github.io/posts/2018/10/27-mirror-descent-algorithm/ | title=Mirror descent algorithm }}</ref>
==Theoretical properties==
The properties of gradient descent depend on the properties of the objective function and the variant of gradient descent used (for example, if a [[line search]] step is used). The assumptions made affect the convergence rate, and other properties, that can be proven for gradient descent.<ref name=":1">{{cite arXiv|last=Bubeck |first=Sébastien |title=Convex Optimization: Algorithms and Complexity |date=2015 |class=math.OC |eprint=1405.4980 }}</ref> For example, if the objective is assumed to be [[Strongly convex function|strongly convex]] and [[Lipschitz continuity|lipschitz smooth]], then gradient descent converges linearly with a fixed step size.<ref name="auto"/> Looser assumptions lead to either weaker convergence guarantees or require a more sophisticated step size selection.<ref name=":1" />
== Examples ==
* [[Yang–Mills flow]]
* [[Yang–Mills–Higgs flow]]
* [[Seiberg–Witten flow]]
==See also==
Line 262 ⟶ 284:
* [[Hill climbing]]
* [[Quantum annealing]]
* [[TFNP#CLS|
* [[Neuroevolution]]
{{Div col end}}
Line 268 ⟶ 291:
{{Reflist|30em}}
==
*{{cite book |first1=Stephen |last1=Boyd |author-link=Stephen P. Boyd |first2=Lieven |last2=Vandenberghe |chapter=Unconstrained Minimization |title=Convex Optimization |___location=New York |publisher=Cambridge University Press |year=2004 |isbn=0-521-83378-7 |chapter-url=https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf#page=471 |pages=457–520 }}
*{{cite book |first1=Edwin K. P. |last1=Chong |first2=Stanislaw H. |last2=Żak |chapter=Gradient Methods |title=An Introduction to Optimization |edition=Fourth |___location=Hoboken |publisher=Wiley |year=2013 |isbn=978-1-118-27901-4 |pages=131–160 |chapter-url=https://books.google.com/books?id=iD5s0iKXHP8C&pg=PA131 }}
*{{cite book |first=David M. |last=Himmelblau |title=Applied Nonlinear Programming |___location=New York |publisher=McGraw-Hill |year=1972 |isbn=0-07-028921-2 |chapter=Unconstrained Minimization Procedures Using Derivatives |pages=63–132 }}
==
{{Commons category|Gradient descent}}
* [
* [https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient Series of Khan Academy videos discusses gradient ascent]
* [http://neuralnetworksanddeeplearning.com/chap1.html#learning_with_gradient_descent Online book teaching gradient descent in deep neural network context]
* Archived at [https://ghostarchive.org/varchive/youtube/20211211/IHZwWFHWa-w Ghostarchive]{{cbignore}} and the [https://web.archive.org/web/20171016173155/https://www.youtube.com/watch?v=IHZwWFHWa-w Wayback Machine]{{cbignore}}: {{cite web |work=3Blue1Brown |title=Gradient Descent, How Neural Networks Learn |date=October 16, 2017 |url=https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=2 |via=[[YouTube]] }}{{cbignore}}
*
{{Artificial intelligence navbox}}
{{Optimization algorithms|unconstrained}}
|