Levenberg–Marquardt algorithm: Difference between revisions

Content deleted Content added
The convergence behaviour is not specific to neural networks. The restriction to a local minimum might be helpful for random readers while it is a general restriction to most algorithms.
The solution: Credit Marquardt with the scale invariance (it is in the 1963 paper); Fletcher 1971 expresses it more cleanly.
Line 47:
The (non-negative) damping factor {{tmath|\lambda}} is adjusted at each iteration. If reduction of {{tmath|S}} is rapid, a smaller value can be used, bringing the algorithm closer to the [[Gauss–Newton algorithm]], whereas if an iteration gives insufficient reduction in the residual, {{tmath|\lambda}} can be increased, giving a step closer to the gradient-descent direction. Note that the [[gradient]] of {{tmath|S}} with respect to {{tmath|\boldsymbol\beta}} equals <math>-2\left (\mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ]\right )^{\mathrm T}</math>. Therefore, for large values of {{tmath|\lambda}}, the step will be taken approximately in the direction opposite to the gradient. If either the length of the calculated step {{tmath|\boldsymbol\delta}} or the reduction of sum of squares from the latest parameter vector {{tmath|\boldsymbol\beta + \boldsymbol\delta}} fall below predefined limits, iteration stops, and the last parameter vector {{tmath|\boldsymbol\beta}} is considered to be the solution.
 
Levenberg's algorithm has the disadvantage that if the value of damping factor {{tmath|\lambda}} is large, inverting {{tmath|\mathbf J^\text{T}\mathbf J + \lambda\mathbf I}} is not used at all.{{Clarify|reason=Unclear sentence. Also possibly unclear mathematical reasons for the statement.|date=May 2021}}

To Fletcher providedmake the insightsolution thatscale weinvariant canMarquardt's scalealgorithm solved a modified problem with each component of the gradient scaled according to the curvature,. soThis that there isprovides larger movement along the directions where the gradient is smaller., Thiswhich avoids slow convergence in the direction of small gradient. Therefore, Fletcher in his 1971 paper ''A modified Marquardt subroutine for non-linear least squares'' replacedsimplified the form, replacing the identity matrix {{tmath|\mathbf I}} with the diagonal matrix consisting of the diagonal elements of {{tmath|\mathbf J^\text{T}\mathbf J}}, thus making the solution scale invariant:
 
:<math>\left [\mathbf J^{\mathrm T} \mathbf J + \lambda \operatorname{diag}\left (\mathbf J^{\mathrm T} \mathbf J\right )\right ] \boldsymbol\delta = \mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ].</math>