Loss function: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 12:49, 22 December 2024 edit Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,663 edits →Constructing loss and objective functions: Linking. ← Previous edit		Latest revision as of 17:33, 2 September 2025 edit undo Lexoka (talk \| contribs) 117 edits m →Selecting a loss function: fixed typo
(8 intermediate revisions by 6 users not shown)
Line 2: In [[mathematical optimization]] and [[decision theory]], a '''loss function''' or '''cost function''' (sometimes also called an error function)<ref name="ttf2001">{{cite book\|first1=Trevor \|last1=Hastie \|authorlink1= \|first2=Robert \|last2=Tibshirani \|authorlink2=Robert Tibshirani\|first3=Jerome H. \|last3=Friedman \|authorlink3=Jerome H. Friedman \|title=The Elements of Statistical Learning \|publisher=Springer \|year=2001 \|isbn=0-387-95284-5 \|page=18 \|url=https://web.stanford.edu/~hastie/ElemStatLearn/}}</ref> is a function that maps an [[event (probability theory)\|event]] or values of one or more variables onto a [[real number]] intuitively representing some "cost" associated with the event. An [[optimization problem]] seeks to minimize a loss function. An '''objective function''' is either a loss function or its opposite (in specific domains, variously called a [[reward function]], a [[profit function]], a [[utility function]], a [[fitness function]], etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. In statistics, typically a loss function is used for [[parameter estimation]], and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as [[Pierre-Simon Laplace\|Laplace]], was reintroduced in statistics by [[Abraham Wald]] in the middle of the 20th century.<ref>{{cite book \|first=A. \|last=Wald \|title=Statistical Decision Functions \|via=APA Psycnet \|publisher=Wiley \|year=1950 \|url=https://psycnet.apa.org/record/1951-01400-000}}</ref> In the context of [[economics]], for example, this is usually [[economic cost]] or [[Regret (decision theory)\|regret]]. In [[Statistical classification\|classification]], it is the penalty for an incorrect classification of an example. In [[actuarial science]], it is used in an insurance context to model benefits paid over premiums, particularly since the works of [[Harald Cramér]] in the 1920s.<ref>{{cite book \|last=Cramér \|first=H. \|year=1930 \|title=On the mathematical theory of risk \|publisher=Centraltryckeriet }}</ref> In [[optimal control]], the loss is the penalty for failing to achieve a desired value. In [[financial risk management]], the function is mapped to a monetary loss. [[File:Comparison of loss functions.png\|thumb\|Comparison of common loss functions ([[Mean absolute error\|MAE]], ~~[[Symmetric mean absolute percentage error\|~~SMAE]], [[Huber loss]], and ~~Log~~log-~~Cosh~~cosh ~~Loss~~loss) used for regression]] ==Examples== Line 19: Many common [[statistic]]s, including [[t-test]]s, [[Regression analysis\|regression]] models, [[design of experiments]], and much else, use [[least squares]] methods applied using [[linear regression]] theory, which is based on the quadratic loss function. The quadratic loss function is also used in [[Linear-quadratic regulator\|linear-quadratic optimal control problems]]. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a [[quadratic form]] in the deviations of the variables of interest from their desired values; this approach is [[closed-form expression\|tractable]] because it results in linear [[first-order condition]]s. In the context of [[stochastic control]], the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the [[Huber loss\|Huber]], ~~Log~~log-~~Cosh~~cosh and SMAE losses are used when the data has many large outliers. [[File:Fitting a straight line to a data with outliers.png\|thumb\|Effect of using different loss functions, when the data has outliers.]] ===0-1 loss function=== Line 41: \|series= Lecture Notes in Economics and Mathematical Systems \|volume=510 \|publisher=Springer \|___location=Berlin\|isbn= 978-3-540-42669-1 \|doi= 10.1007/978-3-642-56038-5 }}</ref> In particular, [[Andranik Tangian]] showed that the most usable objective functions — quadratic and additive — are determined by a few [[Principle of indifference\|indifference]] points. He used this property in the models for constructing these objective functions from either [[ordinal utility\|ordinal]] or [[cardinal utility\|cardinal]] data that were elicited through computer-assisted interviews with decision makers.<ref name="Tangian2002">{{Cite journal\|last=Tangian \|first=Andranik \|year=2002\|title= Constructing a quasi-concave quadratic objective function from interviewing a decision maker\|journal= European Journal of Operational Research \|volume=141 \|issue=3 \|pages=608–640 \|doi=10.1016/S0377-2217(01)00185-0 \|s2cid= 39623350 }}</ref><ref name="Tangian2004additiveUtility">{{Cite journal\|last=Tangian \|first=Andranik \|year=2004\|title= A model for ordinally constructing additive objective functions\|journal= European Journal of Operational Research \|volume=159 \|issue=2 \|pages=476–512\|doi = 10.1016/S0377-2217(03)00413-2 \| s2cid= 31019036 }}</ref> Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities<ref name="Tangian2004universityBudgets">{{Cite journal \|last=Tangian \|first=Andranik \|year=2004 \|title= Redistribution of university budgets with respect to the status quo \|journal= European Journal of Operational Research \|volume=157 \|issue=2 \|pages=409–428\|doi = 10.1016/S0377-2217(03)00271-6 }}</ref> and the European subsidies for equalizing unemployment rates among 271 German regions.<ref name="Tangian2008RegionalEnemployment">{{Cite journal\|last=Tangian \|first=Andranik \|year=2008 \|title= Multi-criteria optimization of regional employment policy: A simulation analysis for Germany \|journal= Review of Urban and Regional Development \|volume=20 \|issue=2\|pages=103–122 \|url= https://onlinelibrary.wiley.com/doi/10.1111/j.1467-940X.2008.00144.x \|doi = 10.1111/j.1467-940X.2008.00144.x \|url-access=subscription }}</ref> ==Expected loss== Line 51: ===Statistics=== Both [[Frequentist probability\|frequentist]] and [[Bayesian probability\|Bayesian]] statistical theory involve making a decision based on the [[expected value]] of the loss function; however, this quantity is defined differently under the two paradigms. ====Frequentist expected loss==== Line 86: :<math>\rho(\pi^,a) = \int_\Theta \int _{\bold X} L(\theta, a(\bold x)) \, \mathrm{d} P(\bold x \vert \theta) \,\mathrm{d} \pi^ (\theta)= \int_{\bold X} \int_\Theta L(\theta,a(\bold x))\,\mathrm{d} \pi^(\theta\vert \bold x)\,\mathrm{d}M(\bold x)</math> where m(x) is known as the ''predictive likelihood'' wherein θ has been "integrated out," {{pi}}<sup></sup> (θ \| x) is the posterior distribution, and the order of integration has been changed. One then should choose the action ''a<sup></sup>'' which minimises this expected loss, which is referred to as ''Bayes Risk'' ~~<sup>[12]</sup>~~. In the latter equation, the integrand inside dx is known as the ''Posterior Risk'', and minimising it with respect to decision ''a'' also minimizes the overall Bayes Risk. This optimal decision, ''a<sup></sup>'' is known as the ''Bayes (decision) Rule'' - it minimises the average loss over all possible states of nature θ, over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ. Line 96: In economics, decision-making under uncertainty is often modelled using the [[von Neumann–Morgenstern utility function]] of the uncertain variable of interest, such as end-of-period wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized. ==Decision rules== Line 120 ⟶ 119: The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties.<ref>Detailed information on mathematical principles of the loss function choice is given in Chapter 2 of the book {{cite book\|title=Robust and Non-Robust Models in Statistics\|first1=B.\|last1=Klebanov\|first2=Svetlozat T.\|last2=Rachev\|first3=Frank J.\|last3=Fabozzi\|publisher=Nova Scientific Publishers, Inc.\|___location=New York\|year=2009}} (and references there).</ref> Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case of [[i.i.d.]] observations, the principle of complete information, and some others. [[W. Edwards Deming]] and [[Nassim Nicholas Taleb]] argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often are not mathematically nice and are not differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after ~~can not~~cannot, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.<ref>{{Cite book\|title=Out of the Crisis\|last=Deming\|first=W. Edwards\|publisher=The MIT Press\|year=2000\|isbn=9780262541152}}</ref> ==See also==