Content deleted Content added
Remove incorrect wikilink; decapitalise |
|||
(41 intermediate revisions by 17 users not shown) | |||
Line 1:
{{Short description|Mathematical relation assigning a probability event to a cost}}
In [[mathematical optimization]] and [[decision theory]], a '''loss function''' or '''cost function''' (sometimes also called an error function)
In statistics, typically a loss function is used for [[parameter estimation]], and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as [[Pierre-Simon Laplace|Laplace]], was reintroduced in statistics by [[Abraham Wald]] in the middle of the 20th century.<ref>{{cite book |first=A. |last=Wald |title=Statistical Decision Functions |via=APA Psycnet |publisher=Wiley |year=1950 |url=https://psycnet.apa.org/record/1951-01400-000}}</ref> In the context of [[economics]], for example, this is usually [[economic cost]] or [[Regret (decision theory)|regret]]. In [[Statistical classification|classification]], it is the penalty for an incorrect classification of an example. In [[actuarial science]], it is used in an insurance context to model benefits paid over premiums, particularly since the works of [[Harald Cramér]] in the 1920s.<ref>{{cite book |last=Cramér |first=H. |year=1930 |title=On the mathematical theory of risk |
[[File:Comparison of loss functions.png|thumb|Comparison of common loss functions ([[Mean absolute error|MAE]], SMAE, [[Huber loss]], and log-cosh loss) used for regression]]
==
===Regret===
{{main|Regret (decision theory)}}
[[Leonard J. Savage]] argued that using non-Bayesian methods such as [[minimax]], the loss function should be based on the idea of ''[[regret (decision theory)|regret]]'', i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made
===Quadratic loss function===
Line 19:
Many common [[statistic]]s, including [[t-test]]s, [[Regression analysis|regression]] models, [[design of experiments]], and much else, use [[least squares]] methods applied using [[linear regression]] theory, which is based on the quadratic loss function.
The quadratic loss function is also used in [[Linear-quadratic regulator|linear-quadratic optimal control problems]]. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a [[quadratic form]] in the deviations of the variables of interest from their desired values; this approach is [[closed-form expression|tractable]] because it results in linear [[first-order condition]]s. In the context of [[stochastic control]], the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the [[Huber loss|Huber]],
[[File:Fitting a straight line to a data with outliers.png|thumb|Effect of using different loss functions, when the data has outliers
===0-1 loss function===
Line 29:
using [[Iverson bracket]] notation, i.e. it evaluates to 1 when <math>\hat{y} \ne y</math>, and 0 otherwise.
==Constructing loss and objective functions==
{{See also|Scoring rule}}
In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also [[utility]] function) in a form suitable for optimization — the problem that [[Ragnar Frisch]] has highlighted in his [[Nobel Prize]] lecture.<ref>{{cite book| first=Ragnar|last=Frisch|date=1969 |title= The Nobel Prize–Prize Lecture|chapter=From utopian theory to practical applications: the case of econometrics|url=https://www.nobelprize.org/prizes/economic-sciences/1969/frisch/lecture/|access-date=15 February 2021}}</ref>
The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences.<ref name="TangianGruber1997">{{Cite book
|last1=Tangian |first1=Andranik |last2=Gruber |first2=Josef |date=1997
Line 54 ⟶ 41:
|series= Lecture Notes in Economics and Mathematical Systems |volume=510
|publisher=Springer |___location=Berlin|isbn= 978-3-540-42669-1 |doi= 10.1007/978-3-642-56038-5 }}</ref>
In particular, [[Andranik Tangian]] showed that the most usable objective functions — quadratic and additive — are determined by a few [[Principle of indifference|indifference]] points. He used this property in the models for constructing these objective functions from either [[ordinal utility|ordinal]] or [[cardinal utility|cardinal]] data that were elicited through computer-assisted interviews with decision makers.<ref name="Tangian2002">{{Cite journal|last=Tangian |first=Andranik |year=2002|title= Constructing a quasi-concave quadratic objective function from interviewing a decision maker|journal= European Journal of Operational Research |volume=141 |issue=3 |pages=608–640 |doi=10.1016/S0377-2217(01)00185-0 |s2cid= 39623350 }}</ref><ref name="Tangian2004additiveUtility">{{Cite journal|last=Tangian |first=Andranik |year=2004|title= A model for ordinally constructing additive objective functions|journal= European Journal of Operational Research |volume=159 |issue=2 |pages=476–512|doi = 10.1016/S0377-2217(03)00413-2 | s2cid= 31019036 }}</ref>
Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities<ref name="Tangian2004universityBudgets">{{Cite journal |last=Tangian |first=Andranik |year=2004 |title= Redistribution of university budgets with respect to the status quo |journal= European Journal of Operational Research |volume=157 |issue=2 |pages=409–428|doi = 10.1016/S0377-2217(03)00271-6 }}</ref>
and the European subsidies for equalizing unemployment rates among 271 German regions.<ref name="Tangian2008RegionalEnemployment">{{Cite journal|last=Tangian |first=Andranik |year=2008
|title= Multi-criteria optimization of regional employment policy: A simulation analysis for Germany |journal= Review of Urban and Regional Development |volume=20 |issue=2|pages=103–122 |url= https://onlinelibrary.wiley.com/doi/10.1111/j.1467-940X.2008.00144.x |doi = 10.1111/j.1467-940X.2008.00144.x |url-access=subscription }}</ref>
==Expected loss==
Line 64 ⟶ 51:
===Statistics===
Both [[Frequentist probability|frequentist]] and [[Bayesian probability|Bayesian]] statistical theory involve making a decision based on the [[expected value]] of the loss function; however, this quantity is defined differently under the two paradigms.
====Frequentist expected loss====
Line 99 ⟶ 86:
:<math>\rho(\pi^*,a) = \int_\Theta \int _{\bold X} L(\theta, a(\bold x)) \, \mathrm{d} P(\bold x \vert \theta) \,\mathrm{d} \pi^* (\theta)= \int_{\bold X} \int_\Theta L(\theta,a(\bold x))\,\mathrm{d} \pi^*(\theta\vert \bold x)\,\mathrm{d}M(\bold x)</math>
where m(x) is known as the ''predictive likelihood'' wherein θ has been "integrated out," {{pi}}<sup>*</sup> (θ | x) is the posterior distribution, and the order of integration has been changed. One then should choose the action ''a<sup>*</sup>'' which minimises this expected loss, which is referred to as ''Bayes Risk''
In the latter equation, the integrand inside dx is known as the ''Posterior Risk'', and minimising it with respect to decision ''a'' also minimizes the overall Bayes Risk. This optimal decision, ''a<sup>*</sup>'' is known as the ''Bayes (decision) Rule'' - it minimises the average loss over all possible states of nature
====Examples in statistics====
* For a scalar parameter ''θ'', a decision function whose output <math>\hat\theta</math> is an estimate of ''θ'', and a quadratic loss function ([[squared error loss]]) <math display="block"> L(\theta,\hat\theta)=(\theta-\hat\theta)^2,</math> the risk function becomes the [[mean squared error]] of the estimate, <math display="block">R(\theta,\hat\theta)= \operatorname{E}_\theta \left [ (\theta-\hat\theta)^2 \right ].</math>An [[Estimator]] found by minimizing the [[Mean squared error]] estimates the [[Posterior distribution]]'s mean.
* In [[density estimation]], the unknown parameter is [[probability density function|probability density]] itself. The loss function is typically chosen to be a [[Norm (mathematics)|norm]] in an appropriate [[function space]]. For example, for [[L2 norm|''L''<sup>2</sup> norm]], <math display="block">L(f,\hat f) = \|f-\hat f\|_2^2\,,</math> the risk function becomes the [[mean integrated squared error]] <math display="block">R(f,\hat f)=\operatorname{E} \left ( \|f-\hat f\|^2 \right ).\,</math>
===Economic choice under uncertainty===
Line 144 ⟶ 131:
==References==
{{reflist}}
==Further reading==
|