Levenberg–Marquardt algorithm

This is an old revision of this page, as edited by Moaia (talk | contribs) at 01:31, 7 February 2020 (See also: immature writing, not necessary). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In mathematics and computing, the Levenberg–Marquardt algorithm (LMA or just LM), also known as the Damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting.

The LMA is used in many software applications for solving generic curve-fitting problems. However, as with many fitting algorithms, the LMA finds only a local minimum, which is not necessarily the global minimum. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

The algorithm was first published in 1944 by Kenneth Levenberg,[1] while working at the Frankford Army Arsenal. It was rediscovered in 1963 by Donald Marquardt,[2] who worked as a statistician at DuPont, and independently by Girard,[3] Wynne[4] and Morrison.[5]

The problem

The primary application of the Levenberg–Marquardt algorithm is in the least-squares curve fitting problem: given a set of   empirical pairs   of independent and dependent variables, find the parameters   of the model curve   so that the sum of the squares of the deviations   is minimized:

  which is assumed to be non-empty.

The solution

Like other numeric minimization algorithms, the Levenberg–Marquardt algorithm is an iterative procedure. To start a minimization, the user has to provide an initial guess for the parameter vector  . In cases with only one minimum, an uninformed standard guess like   will work fine; in cases with multiple minima, the algorithm converges to the global minimum only if the initial guess is already somewhat close to the final solution.

In each iteration step, the parameter vector   is replaced by a new estimate  . To determine  , the function   is approximated by its linearization:

 

where

 

is the gradient (row-vector in this case) of   with respect to  .

The sum   of square deviations has its minimum at a zero gradient with respect to  . The above first-order approximation of   gives

 

or in vector notation,

 

Taking the derivative of   with respect to   and setting the result to zero gives

 

where   is the Jacobian matrix, whose  -th row equals  , and where   and   are vectors with  -th component   and   respectively. The Jacobian matrix as defined above is not (in general) a square matrix, but a rectangular matrix of size  , where   is the number of parameters (size of the vector  ). The matrix multiplication   yields the required   square matrix and the matrix-vector product on the right hand side yields a vector of size  . The result is a set   linear equations, which can be solved for  .

Levenberg's contribution is to replace this equation by a "damped version":

 

where   is the identity matrix, giving as the increment   to the estimated parameter vector  .

The (non-negative) damping factor   is adjusted at each iteration. If reduction of   is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual,   can be increased, giving a step closer to the gradient-descent direction. Note that the gradient of   with respect to   equals  . Therefore, for large values of  , the step will be taken approximately in the direction of the gradient. If either the length of the calculated step   or the reduction of sum of squares from the latest parameter vector   fall below predefined limits, iteration stops, and the last parameter vector   is considered to be the solution.

Levenberg's algorithm has the disadvantage that if the value of damping factor   is large, inverting   is not used at all. R. Fletcher provided the insight that we can scale each component of the gradient according to the curvature, so that there is larger movement along the directions where the gradient is smaller. This avoids slow convergence in the direction of small gradient. Therefore, Fletcher in his 1971 paper A modified Marquardt subroutine for non-linear least squares replaced the identity matrix   with the diagonal matrix consisting of the diagonal elements of  , thus making the solution scale invariant:

 

A similar damping factor appears in Tikhonov regularization, which is used to solve linear ill-posed problems, as well as in ridge regression, an estimation technique in statistics.

Choice of damping parameter

Various more or less heuristic arguments have been put forward for the best choice for the damping parameter  . Theoretical arguments exist showing why some of these choices guarantee local convergence of the algorithm; however, these choices can make the global convergence of the algorithm suffer from the undesirable properties of steepest descent, in particular, very slow convergence close to the optimum.

The absolute values of any choice depend on how well-scaled the initial problem is. Marquardt recommended starting with a value   and a factor  . Initially setting   and computing the residual sum of squares   after one step from the starting point with the damping factor of   and secondly with  . If both of these are worse than the initial point, then the damping is increased by successive multiplication by   until a better point is found with a new damping factor of   for some  .

If use of the damping factor   results in a reduction in squared residual, then this is taken as the new value of   (and the new optimum ___location is taken as that obtained with this damping factor) and the process continues; if using   resulted in a worse residual, but using   resulted in a better residual, then   is left unchanged and the new optimum is taken as the value obtained with   as damping factor.

Example

 
Poor fit
 
Better fit
 
Best fit

In this example we try to fit the function   using the Levenberg–Marquardt algorithm implemented in GNU Octave as the leasqr function. The graphs show progressively better fitting for the parameters  ,   used in the initial curve. Only when the parameters in the last graph are chosen closest to the original, are the curves fitting exactly. This equation is an example of very sensitive initial conditions for the Levenberg–Marquardt algorithm. One reason for this sensitivity is the existence of multiple minima — the function   has minima at parameter value   and  .

See also

References

  1. ^ Levenberg, Kenneth (1944). "A Method for the Solution of Certain Non-Linear Problems in Least Squares". Quarterly of Applied Mathematics. 2 (2): 164–168. doi:10.1090/qam/10666.
  2. ^ Marquardt, Donald (1963). "An Algorithm for Least-Squares Estimation of Nonlinear Parameters". SIAM Journal on Applied Mathematics. 11 (2): 431–441. doi:10.1137/0111030. hdl:10338.dmlcz/104299.
  3. ^ Girard, André (1958). "Excerpt from Revue d'optique théorique et instrumentale". Rev. Opt. 37: 225–241, 397–424.
  4. ^ Wynne, C. G. (1959). "Lens Designing by Electronic Digital Computer: I". Proc. Phys. Soc. Lond. 73 (5): 777–787. Bibcode:1959PPS....73..777W. doi:10.1088/0370-1328/73/5/310.
  5. ^ Morrison, David D. (1960). "Methods for nonlinear least squares problems and convergence proofs". Proceedings of the Jet Propulsion Laboratory Seminar on Tracking Programs and Orbit Determination: 1–9.
  6. ^ Kanzow, Christian; Yamashita, Nobuo; Fukushima, Masao (2004). "Levenberg–Marquardt methods with strong local convergence properties for solving nonlinear equations with convex constraints". Journal of Computational and Applied Mathematics. 172 (2): 375–397. doi:10.1016/j.cam.2004.02.013.

Further reading