Logistic regression: Difference between revisions

Content deleted Content added
m Predictions: {{sfrac}}
Remove poorly sourced material about maximum entropy based on concerns at the talk page
 
(9 intermediate revisions by 8 users not shown)
Line 3:
[[File:Exam pass logistic curve.svg|thumb|400px|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink||Example}} for worked details.]]
 
In [[statistics]], Aa '''logistic model''' (or '''logit model''') is a [[statistical model]] that models the [[logit|log-odds]] of an event as a [[linear function (calculus)|linear combination]] of one or more [[independent variable]]s. In [[regression analysis]], '''logistic regression'''<ref>{{cite journal|last1=Tolles|first1=Juliana|last2=Meurer|first2=William J|date=2016|title=Logistic Regression Relating Patient Characteristics to Outcomes|journal=JAMA |language=en|volume=316|issue=5|pages=533–4|issn=0098-7484|oclc=6823603312|doi=10.1001/jama.2016.7653|pmid=27483067}}</ref> (or '''logit regression''') [[estimation theory|estimates]] the parameters of a logistic model (the coefficients in the linear or non linear combinations). In binary logistic regression there is a single [[binary variable|binary]] [[dependent variable]], coded by an [[indicator variable]], where the two values are labeled "0" and "1", while the [[independent variable]]s can each be a binary variable (two classes, coded by an indicator variable) or a [[continuous variable]] (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;<ref name=Hosmer/> the function that converts log-odds to probability is the [[logistic function]], hence the name. The [[unit of measurement]] for the log-odds scale is called a ''[[logit]]'', from '''''log'''istic un'''it''''', hence the alternative names. See {{slink||Background}} and {{slink||Definition}} for formal mathematics, and {{slink||Example}} for a worked example.
 
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink||Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn|Cramer|2002|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink||Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]].
 
Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink||Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see {{slink||Maximum entropy}}.
 
The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)|linear least squares]]; see {{section link||Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink||Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn|Cramer|2002|p=8}} beginning in {{harvtxt|Berkson|1944}}, where he coined "logit"; see {{slink||History}}.
Line 26:
| issue = 7
| pages = 511–24
| last2 = Cornfield| first2 = J| last3 = Kannel| first3 = W | doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Anyfor any Otherother Partyparty, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name="strano05">{{cite journal | author = M. Strano | author2 = B.M. Colosimo | year = 2006 | title = Logistic regression analysis for experimental determination of forming limit diagrams | journal = International Journal of Machine Tools and Manufacture | volume = 46 | issue = 6 | pages = 673–682 | doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name="safety">{{cite journal | last1 = Palei | first1 = S. K. | last2 = Das | first2 = S. K. | doi = 10.1016/j.ssci.2008.01.002 | title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach | journal = Safety Science | volume = 47 | pages = 88–96 | year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book|title=Data Mining Techniques For Marketing, Sales and Customer Support|last= Berry |first=Michael J.A|publisher=Wiley|year=1997|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]]. Disaster planners and engineers rely on these models to predict decisions taken by householders or building occupants in small-scale and large-scales evacuations, such as building fires, wildfires, hurricanes among others.<ref>{{Cite journal |last1=Mesa-Arango |first1=Rodrigo |last2=Hasan |first2=Samiul |last3=Ukkusuri |first3=Satish V. |last4=Murray-Tuite |first4=Pamela |date=February 2013 |title=Household-Level Model for Hurricane Evacuation Destination Type Choice Using Hurricane Ivan Data |url=https://ascelibrary.org/doi/10.1061/%28ASCE%29NH.1527-6996.0000083 |journal=Natural Hazards Review |language=en |volume=14 |issue=1 |pages=11–20 |doi=10.1061/(ASCE)NH.1527-6996.0000083 |bibcode=2013NHRev..14...11M |issn=1527-6988|url-access=subscription }}</ref><ref>{{Cite journal |last1=Wibbenmeyer |first1=Matthew J. |last2=Hand |first2=Michael S. |last3=Calkin |first3=David E. |last4=Venn |first4=Tyron J. |last5=Thompson |first5=Matthew P. |date=June 2013 |title=Risk Preferences in Strategic Wildfire Decision Making: A Choice Experiment with U.S. Wildfire Managers |url=https://onlinelibrary.wiley.com/doi/10.1111/j.1539-6924.2012.01894.x |journal=Risk Analysis |language=en |volume=33 |issue=6 |pages=1021–1037 |doi=10.1111/j.1539-6924.2012.01894.x |pmid=23078036 |bibcode=2013RiskA..33.1021W |s2cid=45282555 |issn=0272-4332|url-access=subscription }}</ref><ref>{{Cite journal |last1=Lovreglio |first1=Ruggiero |last2=Borri |first2=Dino |last3=dell’Olio |first3=Luigi |last4=Ibeas |first4=Angel |date=2014-02-01 |title=A discrete choice model based on random utilities for exit choice in emergency evacuations |url=https://www.sciencedirect.com/science/article/pii/S0925753513002294 |journal=Safety Science |volume=62 |pages=418–426 |doi=10.1016/j.ssci.2013.10.004 |issn=0925-7535|url-access=subscription }}</ref> These models help in the development of reliable [[Emergency management|disaster managing plans]] and safer design for the [[built environment]].
 
=== Supervised machine learning ===
Line 65:
where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.
 
Remark:Note Thisthat this model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1). The limit value should be a variable parameter too, if you want to make it more realistic.
 
===Fit===
Line 164:
! Hours (''β''<sub>1</sub>)
| 1.5 || 0.9
|| 21.47 || 0.017
|}
 
Line 484:
 
These intuitions can be expressed as follows:
{{table alignment}}
 
{|class="wikitable col2right col3left"
|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
|-
Line 494:
|-
! Middle-income
| moderate + || weak + || {{CNone|none}}
|-
! Low-income
| {{CNone|none|style=text-align:right;}} || strong + || {{CNone|none}}
|-
|}
Line 548:
:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>
 
In order toTo prove that this is equivalent to the previous model, we start by recognizing the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other. As a result, the model is [[nonidentifiable]], in that multiple combinations of '''''β'''''<submath>\boldsymbol\beta_{0}</submath> and '''''β'''''<submath>\boldsymbol\beta_{1}</submath> will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:
 
:<math>
Line 801:
The {{math|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.
 
==Machine Maximumlearning and cross-entropy loss function==
 
Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. [[Probit model|probit regression]], [[Poisson regression]], etc.), the logistic regression solution is unique in that it is a [[Maximum entropy probability distribution|maximum entropy]] solution.<ref name="Mount2011">{{cite web |last=Mount |first=J. |date=2011 |title=The Equivalence of Logistic Regression and Maximum Entropy models |url=https://win-vector.com/2011/09/23/the-equivalence-of-logistic-regression-and-maximum-entropy-models/ |access-date=Feb 23, 2022 |website= |publisher= |quote=}}</ref> This is a case of a general property: an [[exponential family]] of distributions maximizes entropy, given an expected value. In the case of the logistic model, the logistic function is the [[natural parameter]] of the Bernoulli distribution (it is in "[[canonical form]]", and the logistic function is the canonical link function), while other sigmoid functions are non-canonical link functions; this underlies its mathematical elegance and ease of optimization. See {{slink|Exponential family|Maximum entropy derivation}} for details.
 
=== Proof ===
 
In order to show this, we use the method of [[Lagrange multipliers]]. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.<ref name="Mount2011"/>
 
As in the above section on [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]], we will consider {{tmath|M+1}} explanatory variables denoted {{tmath|x_m}} and which include <math>x_0=1</math>. There will be a total of ''K'' data points, indexed by <math>k=\{1,2,\dots,K\}</math>, and the data points are given by <math>x_{mk}</math> and {{tmath|y_k}}. The ''x<sub>mk</sub>'' will also be represented as an {{tmath|(M+1)}}-dimensional vector <math>\boldsymbol{x}_k = \{x_{0k},x_{1k},\dots,x_{Mk}\}</math>. There will be {{tmath|N+1}} possible values of the categorical variable ''y'' ranging from 0 to N.
 
Let ''p<sub>n</sub>('''x''')'' be the probability, given explanatory variable vector '''x''', that the outcome will be <math>y=n</math>. Define <math>p_{nk}=p_n(\boldsymbol{x}_k)</math> which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''.
 
The Lagrangian will be expressed as a function of the probabilities ''p<sub>nk</sub>'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to 1 is part of the Lagrangian formulation, rather than being assumed from the beginning.
 
The first contribution to the Lagrangian is the [[Entropy (information theory)|entropy]]:
 
:<math>\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})</math>
 
The log-likelihood is:
 
:<math>\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})</math>
 
Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:
 
:<math>\frac{\partial \ell}{\partial \beta_{nm}}=\sum_{k=1}^K ( p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>
 
A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''p<sub>nk</sub>'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''p<sub>nk</sub>''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then:
 
:<math>\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K (p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>
 
where the ''&lambda;<sub>nm</sub>'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written:
 
:<math>\sum_{n=0}^N p_{nk}=1</math>
 
so that the normalization term in the Lagrangian is:
 
:<math>\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) </math>
 
where the ''α<sub>k</sub>'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:
 
:<math>\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}</math>
 
Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:
 
:<math>\frac{\partial \mathcal{L}}{\partial p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{n'm}x_{mk'})-\alpha_{k'}</math>
 
Using the more condensed vector notation:
 
:<math>\sum_{m=0}^M \lambda_{nm}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k</math>
 
and dropping the primes on the ''n'' and ''k'' indices, and then solving for <math>p_{nk}</math> yields:
 
:<math>p_{nk}=e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}/Z_k</math>
 
where:
 
:<math>Z_k=e^{1+\alpha_k}</math>
 
Imposing the normalization constraint, we can solve for the ''Z<sub>k</sub>'' and write the probabilities as:
 
:<math>p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}</math>
 
The <math>\boldsymbol{\lambda}_n</math> are not all independent. We can add any constant {{tmath|(M+1)}}-dimensional vector to each of the <math>\boldsymbol{\lambda}_n</math> without changing the value of the <math>p_{nk}</math> probabilities so that there are only ''N'' rather than {{tmath|N+1}} independent <math>\boldsymbol{\lambda}_n</math>. In the [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]] section above, the <math>\boldsymbol{\lambda}_0</math> was subtracted from each <math>\boldsymbol{\lambda}_n</math> which set the exponential term involving <math>\boldsymbol{\lambda}_0</math> to 1, and the beta coefficients were given by <math>\boldsymbol{\beta}_n=\boldsymbol{\lambda}_n-\boldsymbol{\lambda}_0</math>.
 
===Other approaches===
 
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.