Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink||Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn|Cramer|2002|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink||Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]].
Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink||Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see {{slink||Maximum entropy}}.
The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)|linear least squares]]; see {{section link||Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink||Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn|Cramer|2002|p=8}} beginning in {{harvtxt|Berkson|1944}}, where he coined "logit"; see {{slink||History}}.
| issue = 7
| pages = 511–24
| last2 = Cornfield| first2 = J| last3 = Kannel| first3 = W | doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Anyfor Otherany Partyother party, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name="strano05">{{cite journal | author = M. Strano | author2 = B.M. Colosimo | year = 2006 | title = Logistic regression analysis for experimental determination of forming limit diagrams | journal = International Journal of Machine Tools and Manufacture | volume = 46 | issue = 6 | pages = 673–682 | doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name="safety">{{cite journal | last1 = Palei | first1 = S. K. | last2 = Das | first2 = S. K. | doi = 10.1016/j.ssci.2008.01.002 | title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach | journal = Safety Science | volume = 47 | pages = 88–96 | year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book|title=Data Mining Techniques For Marketing, Sales and Customer Support|last= Berry |first=Michael J.A|publisher=Wiley|year=1997|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]]. Disaster planners and engineers rely on these models to predict decisions taken by householders or building occupants in small-scale and large-scales evacuations, such as building fires, wildfires, hurricanes among others.<ref>{{Cite journal |last1=Mesa-Arango |first1=Rodrigo |last2=Hasan |first2=Samiul |last3=Ukkusuri |first3=Satish V. |last4=Murray-Tuite |first4=Pamela |date=February 2013 |title=Household-Level Model for Hurricane Evacuation Destination Type Choice Using Hurricane Ivan Data |url=https://ascelibrary.org/doi/10.1061/%28ASCE%29NH.1527-6996.0000083 |journal=Natural Hazards Review |language=en |volume=14 |issue=1 |pages=11–20 |doi=10.1061/(ASCE)NH.1527-6996.0000083 |bibcode=2013NHRev..14...11M |issn=1527-6988|url-access=subscription }}</ref><ref>{{Cite journal |last1=Wibbenmeyer |first1=Matthew J. |last2=Hand |first2=Michael S. |last3=Calkin |first3=David E. |last4=Venn |first4=Tyron J. |last5=Thompson |first5=Matthew P. |date=June 2013 |title=Risk Preferences in Strategic Wildfire Decision Making: A Choice Experiment with U.S. Wildfire Managers |url=https://onlinelibrary.wiley.com/doi/10.1111/j.1539-6924.2012.01894.x |journal=Risk Analysis |language=en |volume=33 |issue=6 |pages=1021–1037 |doi=10.1111/j.1539-6924.2012.01894.x |pmid=23078036 |bibcode=2013RiskA..33.1021W |s2cid=45282555 |issn=0272-4332|url-access=subscription }}</ref><ref>{{Cite journal |last1=Lovreglio |first1=Ruggiero |last2=Borri |first2=Dino |last3=dell’Olio |first3=Luigi |last4=Ibeas |first4=Angel |date=2014-02-01 |title=A discrete choice model based on random utilities for exit choice in emergency evacuations |url=https://www.sciencedirect.com/science/article/pii/S0925753513002294 |journal=Safety Science |volume=62 |pages=418–426 |doi=10.1016/j.ssci.2013.10.004 |issn=0925-7535|url-access=subscription }}</ref> These models help in the development of reliable [[Emergency management|disaster managing plans]] and safer design for the [[built environment]].
=== Supervised machine learning ===
The {{math|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.
==Machine Maximumlearning and cross-entropy loss function==
Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. [[Probit model|probit regression]], [[Poisson regression]], etc.), the logistic regression solution is unique in that it is a [[Maximum entropy probability distribution|maximum entropy]] solution.<ref name="Mount2011">{{cite web |last=Mount |first=J. |date=2011 |title=The Equivalence of Logistic Regression and Maximum Entropy models |url=https://win-vector.com/2011/09/23/the-equivalence-of-logistic-regression-and-maximum-entropy-models/ |access-date=Feb 23, 2022 |website= |publisher= |quote=}}</ref> This is a case of a general property: an [[exponential family]] of distributions maximizes entropy, given an expected value. In the case of the logistic model, the logistic function is the [[natural parameter]] of the Bernoulli distribution (it is in "[[canonical form]]", and the logistic function is the canonical link function), while other sigmoid functions are non-canonical link functions; this underlies its mathematical elegance and ease of optimization. See {{slink|Exponential family|Maximum entropy derivation}} for details.
=== Proof ===
In order to show this, we use the method of [[Lagrange multipliers]]. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.<ref name="Mount2011"/>
As in the above section on [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]], we will consider {{tmath|M+1}} explanatory variables denoted {{tmath|x_m}} and which include <math>x_0=1</math>. There will be a total of ''K'' data points, indexed by <math>k=\{1,2,\dots,K\}</math>, and the data points are given by <math>x_{mk}</math> and {{tmath|y_k}}. The ''x<sub>mk</sub>'' will also be represented as an {{tmath|(M+1)}}-dimensional vector <math>\boldsymbol{x}_k = \{x_{0k},x_{1k},\dots,x_{Mk}\}</math>. There will be {{tmath|N+1}} possible values of the categorical variable ''y'' ranging from 0 to N.
Let ''p<sub>n</sub>('''x''')'' be the probability, given explanatory variable vector '''x''', that the outcome will be <math>y=n</math>. Define <math>p_{nk}=p_n(\boldsymbol{x}_k)</math> which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''.
The Lagrangian will be expressed as a function of the probabilities ''p<sub>nk</sub>'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to 1 is part of the Lagrangian formulation, rather than being assumed from the beginning.
The first contribution to the Lagrangian is the [[Entropy (information theory)|entropy]]:
:<math>\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})</math>
The log-likelihood is:
:<math>\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})</math>
Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:
:<math>\frac{\partial \ell}{\partial \beta_{nm}}=\sum_{k=1}^K ( p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>
A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''p<sub>nk</sub>'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''p<sub>nk</sub>''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then:
:<math>\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K (p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>
where the ''λ<sub>nm</sub>'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written:
:<math>\sum_{n=0}^N p_{nk}=1</math>
so that the normalization term in the Lagrangian is:
:<math>\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) </math>
where the ''α<sub>k</sub>'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:
:<math>\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}</math>
Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:
:<math>\frac{\partial \mathcal{L}}{\partial p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{n'm}x_{mk'})-\alpha_{k'}</math>
Using the more condensed vector notation:
:<math>\sum_{m=0}^M \lambda_{nm}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k</math>
and dropping the primes on the ''n'' and ''k'' indices, and then solving for <math>p_{nk}</math> yields:
:<math>p_{nk}=e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}/Z_k</math>
where:
:<math>Z_k=e^{1+\alpha_k}</math>
Imposing the normalization constraint, we can solve for the ''Z<sub>k</sub>'' and write the probabilities as:
:<math>p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}</math>
The <math>\boldsymbol{\lambda}_n</math> are not all independent. We can add any constant {{tmath|(M+1)}}-dimensional vector to each of the <math>\boldsymbol{\lambda}_n</math> without changing the value of the <math>p_{nk}</math> probabilities so that there are only ''N'' rather than {{tmath|N+1}} independent <math>\boldsymbol{\lambda}_n</math>. In the [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]] section above, the <math>\boldsymbol{\lambda}_0</math> was subtracted from each <math>\boldsymbol{\lambda}_n</math> which set the exponential term involving <math>\boldsymbol{\lambda}_0</math> to 1, and the beta coefficients were given by <math>\boldsymbol{\beta}_n=\boldsymbol{\lambda}_n-\boldsymbol{\lambda}_0</math>.
===Other approaches===
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.
|