Content deleted Content added
The convergence behaviour is not specific to neural networks. The restriction to a local minimum might be helpful for random readers while it is a general restriction to most algorithms. |
→The solution: Credit Marquardt with the scale invariance (it is in the 1963 paper); Fletcher 1971 expresses it more cleanly. |
||
Line 47:
The (non-negative) damping factor {{tmath|\lambda}} is adjusted at each iteration. If reduction of {{tmath|S}} is rapid, a smaller value can be used, bringing the algorithm closer to the [[Gauss–Newton algorithm]], whereas if an iteration gives insufficient reduction in the residual, {{tmath|\lambda}} can be increased, giving a step closer to the gradient-descent direction. Note that the [[gradient]] of {{tmath|S}} with respect to {{tmath|\boldsymbol\beta}} equals <math>-2\left (\mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ]\right )^{\mathrm T}</math>. Therefore, for large values of {{tmath|\lambda}}, the step will be taken approximately in the direction opposite to the gradient. If either the length of the calculated step {{tmath|\boldsymbol\delta}} or the reduction of sum of squares from the latest parameter vector {{tmath|\boldsymbol\beta + \boldsymbol\delta}} fall below predefined limits, iteration stops, and the last parameter vector {{tmath|\boldsymbol\beta}} is considered to be the solution.
Levenberg's algorithm has the disadvantage that if the value of damping factor {{tmath|\lambda}} is large, inverting {{tmath|\mathbf J^\text{T}\mathbf J + \lambda\mathbf I}} is not used at all.{{Clarify|reason=Unclear sentence. Also possibly unclear mathematical reasons for the statement.|date=May 2021}}
To :<math>\left [\mathbf J^{\mathrm T} \mathbf J + \lambda \operatorname{diag}\left (\mathbf J^{\mathrm T} \mathbf J\right )\right ] \boldsymbol\delta = \mathbf J^{\mathrm T}\left [\mathbf y - \mathbf f\left (\boldsymbol\beta\right )\right ].</math>
|