Content deleted Content added
Xzhou.lumi (talk | contribs) |
→Room for clarification: new section |
||
Line 81:
The data sequence has very high frequency components. The sparse sampling plan shown in the plots will necessarily contain huge aliasing error that is beyond any inverse algorithm. The way to correct this apparent poor performance of L-M is to sample the data train much more densely, e.g. in 0.001 step size, then the L-M will find the correct answer easily under a very wide range of initial conditions. It is really not L-M that is at fault here. --[[User:Xzhou.lumi|Xzhou.lumi]] ([[User talk:Xzhou.lumi|talk]]) 22:32, 2 July 2008 (UTC)
== Room for clarification ==
I think the article could do with some clarification; I know I could. Most optimization algorithms I have looked at focus on searching in R<sup>n</sup> for the minimum of some scalar-valued function of R<sup>n</sup>. In this article, that function is clearly <math>S(\boldsymbol{\beta})</math>, but the article doesn't emphasize that as much as it emphasizes <math>f(x_i,\boldsymbol{\beta})</math>.
As I try to fully understand L–M, I keep seeing it as a [[trust region]] algorithm in which we do a second-order Taylor expansion, <math>S(\boldsymbol{\beta}+\boldsymbol{\delta})</math> in the neighborhood of <math>\boldsymbol{\delta}=\mathbf{0}</math>. In that case we would have
: <math>S(\boldsymbol{\beta}+\boldsymbol{\delta}) \approx S(\boldsymbol{\beta}) + \operatorname{grad}(S) \cdot \boldsymbol{\delta} + \boldsymbol{\delta}^\mathrm{T} \operatorname{Hessian}(S) \boldsymbol{\delta}</math>
It looks like L–M uses <math>\mathbf{J}^\mathrm{T}\mathbf{J}</math> for the Hessian. Is this the exact Hessian or just an approximation? (It [[Gauss–Newton_algorithm#Derivation_from_Newton.27s_method|looks like an approximation]], but I can't picture the difference between the true Hessian and the approximation.)
Now, the gradient of ''S'' WRT '''&beta''' is given in this article as
: <math>\operatorname{grad}(S) = (-2)(\mathbf{J}^{T} [y - f(\boldsymbol \beta) ] )^{T}</math>.
With some hand waving, I can see this giving us the equation for the minimum of ''S'' of
:<math>\mathbf{(J^{T}J)\boldsymbol \delta = J^{T} [y - f(\boldsymbol \beta)]} \!</math>
Then it looks like the damping parameter basically defines the radius of the trust region and that it isn't a hard-edged trust region but rather the <math>\lambda I</math> term modifies the cost function to keep the local minimum from being too far away (basically adding a paraboloid, scaled by λ to ''S''. Is that about right?
Finally, I don't quite see the intuition for using <math>\operatorname{diag}(\mathbf{D}^\mathrm{T}\mathbf{D})</math> instead of the identity.
Any answers/thoughts/corrections would be appreciated. Thanks. [[User:BenFrantzDale|—Ben FrantzDale]] ([[User talk:BenFrantzDale|talk]]) 22:04, 7 December 2008 (UTC)
|