Gradient descent: Difference between revisions

Content deleted Content added
Rambor12 (talk | contribs)
m ndash
Line 43:
In this analogy, the persons represent the algorithm, and the path taken down the mountain represents the sequence of parameter settings that the algorithm will explore. The steepness of the hill represents the [[slope]] of the function at that point. The instrument used to measure steepness is [[Differentiation (mathematics)|differentiation]]. The direction they choose to travel in aligns with the [[gradient]] of the function at that point. The amount of time they travel before taking another measurement is the step size.
 
=== Choosing the step size and descent direction ===
Since using a step size <math>\eta</math> that is too small would slow convergence, and a <math>\eta</math> too large would lead to overshoot and divergence, finding a good setting of <math>\eta</math> is an important practical problem. [[Philip Wolfe (mathematician)|Philip Wolfe]] also advocated using "clever choices of the [descent] direction" in practice.<ref>{{cite journal |last1=Wolfe |first1=Philip |title=Convergence Conditions for Ascent Methods |journal=SIAM Review |date=April 1969 |volume=11 |issue=2 |pages=226–235 |doi=10.1137/1011036 }}</ref> While using a direction that deviates from the steepest descent direction may seem counter-intuitive, the idea is that the smaller slope may be compensated for by being sustained over a much longer distance.
 
Line 63:
 
==Solution of a linear system==
 
[[File:Steepest descent.png|thumb|380px|The steepest descent algorithm applied to the [[Wiener filter]]<ref>Haykin, Simon S. Adaptive filter theory. Pearson Education India, 2008. - p. 108-142, 217-242</ref>]]
 
Line 115 ⟶ 114:
The method is rarely used for solving linear equations, with the [[conjugate gradient method]] being one of the most popular alternatives. The number of gradient descent iterations is commonly proportional to the spectral [[condition number]] <math>\kappa(\mathbf{A})</math> of the system matrix <math>\mathbf{A}</math> (the ratio of the maximum to minimum [[eigenvalues]] of {{nowrap|<math>\mathbf{A}^\top \mathbf{A}</math>)}}, while the convergence of [[conjugate gradient method]] is typically determined by a square root of the condition number, i.e., is much faster. Both methods can benefit from [[Preconditioner|preconditioning]], where gradient descent may require less assumptions on the preconditioner.<ref name=":0" />
 
=== Geometric behavior and residual orthogonality ===
In steepest descent applied to solving <math> \mathbf{A x} = \mathbf{b} </math>, where <math> \mathbf{A} </math> is symmetric positive-definite, the residual vectors <math> \mathbf{r}_k = \mathbf{b} - \mathbf{A}\mathbf{x}_k </math> are orthogonal across iterations:
 
Line 132 ⟶ 131:
 
==Solution of a non-linear system==
 
Gradient descent can also be used to solve a system of [[nonlinear equation]]s. Below is an example that shows how to use the gradient descent to solve for three unknown variables, ''x''<sub>1</sub>, ''x''<sub>2</sub>, and ''x''<sub>3</sub>. This example shows one iteration of the gradient descent.
 
Line 232 ⟶ 230:
 
==Comments==
 
Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case, the search space is typically a [[function space]], and one calculates the [[Fréchet derivative]] of the functional to be minimized to determine the descent direction.<ref name="AK82">{{cite book |first1=G. P. |last1=Akilov |first2=L. V. |last2=Kantorovich |author-link2=Leonid Kantorovich |title=Functional Analysis |publisher=Pergamon Press |edition=2nd |isbn=0-08-023036-9 |year=1982 }}</ref>
 
That gradient descent works in any number of dimensions (finite number at least) can be seen as a consequence of the [[Cauchy-SchwarzCauchy–Schwarz inequality]], i.e. the magnitude of the inner (dot) product of two vectors of any dimension is maximized when they are [[colinear]]. In the case of gradient descent, that would be when the vector of independent variable adjustments is proportional to the gradient vector of partial derivatives.
 
The gradient descent can take many iterations to compute a local minimum with a required [[accuracy]], if the [[curvature]] in different directions is very different for the given function. For such functions, [[preconditioning]], which changes the geometry of the space to shape the function level sets like [[concentric circles]], cures the slow convergence. Constructing and applying preconditioning can be computationally expensive, however.
Line 266 ⟶ 263:
Gradient descent is a special case of [[mirror descent]] using the squared Euclidean distance as the given [[Bregman divergence]].<ref>{{cite web | url=https://tlienart.github.io/posts/2018/10/27-mirror-descent-algorithm/ | title=Mirror descent algorithm }}</ref>
 
== Theoretical properties ==
The properties of gradient descent depend on the properties of the objective function and the variant of gradient descent used (for example, if a [[line search]] step is used). The assumptions made affect the convergence rate, and other properties, that can be proven for gradient descent.<ref name=":1">{{cite arXiv|last=Bubeck |first=Sébastien |title=Convex Optimization: Algorithms and Complexity |date=2015 |class=math.OC |eprint=1405.4980 }}</ref> For example, if the objective is assumed to be [[Strongly convex function|strongly convex]] and [[Lipschitz continuity|lipschitz smooth]], then gradient descent converges linearly with a fixed step size.<ref name="auto"/> Looser assumptions lead to either weaker convergence guarantees or require a more sophisticated step size selection.<ref name=":1" />