Levenberg–Marquardt algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 02:32, 1 January 2022 edit Sennalen (talk \| contribs) Extended confirmed users 2,099 edits application to neural networks ← Previous edit		Latest revision as of 07:50, 26 April 2024 edit undo David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,586 edits →Further reading: rm deadlink that just goes to the same place as the doi
(9 intermediate revisions by 9 users not shown)
Line 1: {{short description\|Algorithm used to solve non-linear least squares problems}} In [[mathematics]] and computing, the '''Levenberg–Marquardt algorithm''' ('''LMA''' or just '''LM'''), also known as the '''damped least-squares''' ('''DLS''') method, is used to solve [[non-linear least squares]] problems. These minimization problems arise especially in [[least squares]] [[curve fitting]]. ~~Applied~~ toThe ~~[[neural~~LMA ~~network\|artificial~~interpolates ~~neural~~between ~~network~~the [[Gauss–Newton algorithm]] ~~training,~~(GNA) aand ~~Levenberg-Marquardt~~the ~~algorithm~~method ~~often~~of ~~converges~~[[gradient ~~faster~~descent]]. ~~than~~The ~~first-order~~LMA is more [[~~backpropagation~~Robustness (computer science)\|robust]] ~~methods.<ref>{{cite~~than ~~article\|title=Improved~~the ~~Computation~~GNA, ~~for~~which ~~Levenberg–Marquardt~~means ~~Training\|last1=Wiliamowski\|first1=Bogdan\|last2=Yu\|first2=Hao\|journal=IEEE~~that ~~Transactions~~in onmany ~~Neural~~cases ~~Networks~~it finds a solution even if it starts very far off the final minimum. For well-behaved functions and ~~Learning~~reasonable ~~Systems\|volume=21\|issue=6\|date=June~~starting ~~2010\|url=https://www~~parameters, the LMA tends to be slower than the GNA.~~eng.auburn.edu/~wilambm/pap/2010/Improved%20Computation%20for%20LM%20Training~~ LMA can also be viewed as [[Gauss–Newton]] using a [[trust region]] approach.~~pdf}}</ref>~~ The LMA is used in many software applications for solving generic curve-fitting problems. However, as with many fitting algorithms, the LMA finds only a [[local minimum]], which is not necessarily the [[global minimum]]. The LMA interpolates between the [[Gauss–Newton algorithm]] (GNA) and the method of [[gradient descent]]. The LMA is more [[Robustness (computer science)\|robust]] than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as [[Gauss–Newton]] using a [[trust region]] approach. The algorithm was first published in 1944 by [[Kenneth Levenberg]],<ref name="Levenberg"/> while working at the [[Frankford Arsenal\|Frankford Army Arsenal]]. It was rediscovered in 1963 by [[Donald Marquardt]],<ref name="Marquardt"/> who worked as a [[statistician]] at [[DuPont]], and independently by Girard,<ref name="Girard"/> Wynne<ref name="Wynne"/> and Morrison.<ref name="Morrison"/> The LMA is used in many software applications for solving generic curve-fitting problems. By using the Gauss–Newton algorithm it often converges faster than first-order methods.<ref>{{cite journal\|title=Improved Computation for Levenberg–Marquardt Training\|last1=Wiliamowski\|first1=Bogdan\|last2=Yu\|first2=Hao\|journal=IEEE Transactions on Neural Networks and Learning Systems\|volume=21\|issue=6\|date=June 2010\|url=https://www.eng.auburn.edu/~wilambm/pap/2010/Improved%20Computation%20for%20LM%20Training.pdf}}</ref> However, like other iterative optimization algorithms, the LMA finds only a [[local minimum]], which is not necessarily the [[global minimum]]. == The problem == Line 31: &= \left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ]^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ] - 2\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ]^{\mathrm T} \mathbf J \boldsymbol\delta + \boldsymbol\delta^{\mathrm T} \mathbf J^{\mathrm T} \mathbf J\boldsymbol\delta. \end{align}</math> Taking the derivative of this approximation of <math>S\left (\boldsymbol\beta + \boldsymbol\delta\right )</math> with respect to {{tmath\|\boldsymbol\delta}} and setting the result to zero gives :<math>\left (\mathbf J^{\mathrm T} \mathbf J\right )\boldsymbol\delta = \mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ],</math> where <math>\mathbf J</math> is the [[Jacobian matrix and determinant\|Jacobian matrix]], whose {{tmath\|i}}-th row equals <math>\mathbf J_i</math>, and where <math>\mathbf f\left (\boldsymbol\beta\right )</math> and <math>\mathbf y</math> are vectors with {{tmath\|i}}-th component <math>f\left (x_i, \boldsymbol\beta\right )</math> and <math>y_i</math> respectively. The above expression obtained for {{tmath\|\boldsymbol\beta}} comes under the ~~Gauss-Newton~~Gauss–Newton method. The Jacobian matrix as defined above is not (in general) a square matrix, but a rectangular matrix of size <math>m \times n</math>, where <math>n</math> is the number of parameters (size of the vector <math>\boldsymbol\beta</math>). The matrix multiplication <math>\left (\mathbf J^{\mathrm T} \mathbf J\right)</math> yields the required <math>n \times n</math> square matrix and the matrix-vector product on the right hand side yields a vector of size <math>n</math>. The result is a set of <math>n</math> linear equations, which can be solved for {{tmath\|\boldsymbol\delta}}. Levenberg's contribution is to replace this equation by a "damped version": Line 46: The (non-negative) damping factor {{tmath\|\lambda}} is adjusted at each iteration. If reduction of {{tmath\|S}} is rapid, a smaller value can be used, bringing the algorithm closer to the [[Gauss–Newton algorithm]], whereas if an iteration gives insufficient reduction in the residual, {{tmath\|\lambda}} can be increased, giving a step closer to the gradient-descent direction. Note that the [[gradient]] of {{tmath\|S}} with respect to {{tmath\|\boldsymbol\beta}} equals <math>-2\left (\mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ]\right )^{\mathrm T}</math>. Therefore, for large values of {{tmath\|\lambda}}, the step will be taken approximately in the direction opposite to the gradient. If either the length of the calculated step {{tmath\|\boldsymbol\delta}} or the reduction of sum of squares from the latest parameter vector {{tmath\|\boldsymbol\beta + \boldsymbol\delta}} fall below predefined limits, iteration stops, and the last parameter vector {{tmath\|\boldsymbol\beta}} is considered to be the solution. When the damping factor {{tmath\|\lambda}} is large relative to <math> \\| \mathbf J^{\mathrm T} \mathbf J \\| </math>, inverting <math> \mathbf J^{\mathrm T} \mathbf J + \lambda \mathbf I </math> is not necessary, as the update is well-approximated by the small gradient step <math> \lambda^{-1} \mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ]</math>. Levenberg's algorithm has the disadvantage that if the value of damping factor {{tmath\|\lambda}} is large, inverting {{tmath\|\mathbf J^\text{T}\mathbf J + \lambda\mathbf I}} is not used at all.{{Clarify\|reason=Unclear sentence. Also possibly unclear mathematical reasons for the statement.\|date=May 2021}} Fletcher provided the insight that we can scale each component of the gradient according to the curvature, so that there is larger movement along the directions where the gradient is smaller. This avoids slow convergence in the direction of small gradient. Therefore, Fletcher in his 1971 paper ''A modified Marquardt subroutine for non-linear least squares'' replaced the identity matrix {{tmath\|\mathbf I}} with the diagonal matrix consisting of the diagonal elements of {{tmath\|\mathbf J^\text{T}\mathbf J}}, thus making the solution scale invariant:▼ ▲~~Levenberg's algorithm~~To ~~has~~make the ~~disadvantage~~solution ~~that~~scale ifinvariant ~~the~~Marquardt's ~~value~~algorithm ofsolved ~~damping~~a ~~factor~~modified ~~{{tmath\|\lambda}}~~problem is large, inverting {{tmath\|\mathbf J^\text{T}\mathbf J + \lambda\mathbf I}} is not used at all.{{Clarify\|reason=Unclear sentence. Also possibly unclear mathematical reasons for the statement.\|date=May 2021}} Fletcher provided the insight that we can scalewith each component of the gradient scaled according to the curvature,. soThis ~~that there is~~provides larger movement along the directions where the gradient is smaller., ~~This~~which avoids slow convergence in the direction of small gradient. ~~Therefore,~~ Fletcher in his 1971 paper ''A modified Marquardt subroutine for non-linear least squares'' ~~replaced~~simplified the form, replacing the identity matrix {{tmath\|\mathbf I}} with the diagonal matrix consisting of the diagonal elements of {{tmath\|\mathbf J^\text{T}\mathbf J}}~~, thus making the solution scale invariant~~: :<math>\left [\mathbf J^{\mathrm T} \mathbf J + \lambda \operatorname{diag}\left (\mathbf J^{\mathrm T} \mathbf J\right )\right ] \boldsymbol\delta = \mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ].</math> Line 94 ⟶ 96: where <math>\alpha</math> is usually fixed to a value lesser than 1, with smaller values for harder problems.<ref name="Transtrum2012"/> The addition of a geodesic acceleration term can allow significant increase in convergence speed and it is especially useful when the algorithm is moving through narrow canyons in the landscape of the objective function, where the allowed steps are smaller and the higher accuracy due to the second order term gives ~~significative~~significant improvements.<ref name="Transtrum2012"/> ==Example== Line 202 ⟶ 204: \|number = 4 \|pages = W1–W16 ~~\|url = http://link.aip.org/link/?GPY/72/W1/1~~ \|bibcode = 2007Geop...72W...1P }} ~~}}{{Dead link\|date=February 2020 \|bot=InternetArchiveBot \|fix-attempted=yes }}~~ * {{cite book \| last1 = Nocedal \| first1 = Jorge Line 217 ⟶ 218: == External links == * Detailed description of the algorithm can be found in [~~http~~https://~~www~~numerical.~~nrbook.com/a~~recipes/~~bookcpdf~~book.~~php~~html Numerical Recipes in C, Chapter 15.5: Nonlinear models] * C. T. Kelley, ''Iterative Methods for Optimization'', SIAM Frontiers in Applied Mathematics, no 18, 1999, {{isbn\|0-89871-433-8}}. [http://www.siam.org/books/textbooks/fr18_book.pdf Online copy] * [https://web.archive.org/web/20140301154319/http://www3.villanova.edu/maple/misc/mtc1093.html History of the algorithm in SIAM news] Line 225 ⟶ 226: * H. P. Gavin, [http://people.duke.edu/~hpgavin/ce281/lm.pdf ''The Levenberg-Marquardt method for nonlinear least-squares curve-fitting problems''] ([[MATLAB]] implementation included) {{Optimization algorithms\|unconstrained}} {{DEFAULTSORT:Levenberg-Marquardt algorithm}}