This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template.

Loss function surrogates for classification are computationally feasible loss functions representing the price we will pay for inaccuracy in our predictions in classification problems. [1] Specifically, if {-1,1} represents the mapping of a vector to a class label {-1,1}, we wish to find a function which best approximates the true mapping . (citation needed) Given that loss functions are always true functions of only one variable, it is standard practice to define loss functions for classification solely in terms of the product of the true classifier and the predicted value . (citation needed) Selection of a loss function within this framework
impacts the optimal which minimizes empirical risk, as well as the computational complexity of the learning algorithm.
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for false positives and false negatives) would be the 0-1 indicator function which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. Consequently, we could choose the loss function:
where indicates the Heaviside step function. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an NP-hard combinatorial optimization problem. (cite utah) As a result, we seek continuous, convex loss function surrogates which are tractable for our learning algorithms. In addition to their computational tractability, the convexity of these functions allows us to provide probabilistic bounds on their estimation error from the true distribution. (cite uci) Some of these surrogates are described below.
Square Loss
While more commonly used in regression, the square loss function can be re-written as a function and utilized for classification. Defined as
the square loss function is both convex and smooth and matches the 0-1 indicator function when and when . However, the square loss function tends to penalize outliers excessively leading to slower convergence rates than for the logistic loss or hinge loss functions. In addition, functions which yield high values of for some will perform poorly with the square loss function, since high values of will be penalized severely, regardless of whether the signs of match.
Hinge Loss
The hinge loss function is defined as
The hinge loss provides a relatively tight, convex upper bound on the 0-1 indicator function. Specifically, the hinge loss equals the 0-1 indicator function when and yeah. In addition, the structure provides a "maximum-margin" classification for support vector machines (SVMs).
While the hinge loss function is both convex and continuous, it is not smooth (that is not differentiable) at . Consequently, the hinge loss function cannot be used with gradient descent methods or stochastic gradient descent methods which rely on differentiability over the entire ___domain. However, the hinge loss does have a subgradient at , which allows for the utilization of subgradient descent methods. (cite Utah)
Logistic Loss
References
- ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1162/089976604773135104, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with
|doi=10.1162/089976604773135104
instead.