Logistic regression: Difference between revisions

Content deleted Content added
Tag: Reverted
Remove poorly sourced material about maximum entropy based on concerns at the talk page
 
(39 intermediate revisions by 30 users not shown)
Line 2:
{{Redirect-distinguish|Logit model|Logit function}}
[[File:Exam pass logistic curve.svg|thumb|400px|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink||Example}} for worked details.]]
 
In [[statistics]], thea '''logistic model''' (or '''logit model''') is a [[statistical model]] that models the [[logit|log-odds]] of an event as a [[linear function (calculus)|linear combination]] of one or more [[independent variable]]s. In [[regression analysis]], '''logistic regression'''<ref>{{cite journal|last1=Tolles|first1=Juliana|last2=Meurer|first2=William J|date=2016|title=Logistic Regression Relating Patient Characteristics to Outcomes|journal=JAMA |language=en|volume=316|issue=5|pages=533–4|issn=0098-7484|oclc=6823603312|doi=10.1001/jama.2016.7653|pmid=27483067}}</ref> (or '''logit regression''') is [[estimation theory|estimatingestimates]] the parameters of a logistic model (the coefficients in the linear combinationor non linear combinations). Formally, inIn binary logistic regression there is a single [[binary variable|binary]] [[dependent variable]], coded by an [[indicator variable]], where the two values are labeled "0" and "1", while the [[independent variable]]s can each be a binary variable (two classes, coded by an indicator variable) or a [[continuous variable]] (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;<ref name=Hosmer/> the function that converts log-odds to probability is the [[logistic function]], hence the name. The [[unit of measurement]] for the log-odds scale is called a ''[[logit]]'', from '''''log'''istic un'''it''''', hence the alternative names. See {{slink||Background}} and {{slink||Definition}} for formal mathematics, and {{slink||Example}} for a worked example.
 
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink||Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn|Cramer|2002|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink||Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]].
 
Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink||Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see {{slink||Maximum entropy}}.
 
The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)|linear least squares]]; see {{section link||Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink||Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn|Cramer|2002|p=8}} beginning in {{harvtxt|Berkson|1944}}, where he coined "logit"; see {{slink||History}}.
Line 25 ⟶ 26:
| issue = 7
| pages = 511–24
| last2 = Cornfield| first2 = J| last3 = Kannel| first3 = W | doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Anyfor any Otherother Partyparty, based on age, income, sex, race, state of residence, votes in previous elections, etc.<ref name="rms" /> The technique can also be used in [[engineering]], especially for predicting the probability of failure of a given process, system or product.<ref name="strano05">{{cite journal | author = M. Strano | author2 = B.M. Colosimo | year = 2006 | title = Logistic regression analysis for experimental determination of forming limit diagrams | journal = International Journal of Machine Tools and Manufacture | volume = 46 | issue = 6 | pages = 673–682 | doi = 10.1016/j.ijmachtools.2005.07.005 }}</ref><ref name="safety">{{cite journal | last1 = Palei | first1 = S. K. | last2 = Das | first2 = S. K. | doi = 10.1016/j.ssci.2008.01.002 | title = Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach | journal = Safety Science | volume = 47 | pages = 88–96 | year = 2009 }}</ref> It is also used in [[marketing]] applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.<ref>{{cite book|title=Data Mining Techniques For Marketing, Sales and Customer Support|last= Berry |first=Michael J.A|publisher=Wiley|year=1997|page=10}}</ref> In [[economics]], it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a [[mortgage]]. [[Conditional random field]]s, an extension of logistic regression to sequential data, are used in [[natural language processing]]. Disaster planners and engineers rely on these models to predict decisiondecisions taketaken by householders or building occupants in small-scale and large-scales evacuations , such as building fires, wildfires, hurricanes among others.<ref>{{Cite journal |last1=Mesa-Arango |first1=Rodrigo |last2=Hasan |first2=Samiul |last3=Ukkusuri |first3=Satish V. |last4=Murray-Tuite |first4=Pamela |date=February 2013 |title=Household-Level Model for Hurricane Evacuation Destination Type Choice Using Hurricane Ivan Data |url=https://ascelibrary.org/doi/10.1061/%28ASCE%29NH.1527-6996.0000083 |journal=Natural Hazards Review |language=en |volume=14 |issue=1 |pages=11–20 |doi=10.1061/(ASCE)NH.1527-6996.0000083 |bibcode=2013NHRev..14...11M |issn=1527-6988|url-access=subscription }}</ref><ref>{{Cite journal |last1=Wibbenmeyer |first1=Matthew J. |last2=Hand |first2=Michael S. |last3=Calkin |first3=David E. |last4=Venn |first4=Tyron J. |last5=Thompson |first5=Matthew P. |date=June 2013 |title=Risk Preferences in Strategic Wildfire Decision Making: A Choice Experiment with U.S. Wildfire Managers |url=https://onlinelibrary.wiley.com/doi/10.1111/j.1539-6924.2012.01894.x |journal=Risk Analysis |language=en |volume=33 |issue=6 |pages=1021–1037 |doi=10.1111/j.1539-6924.2012.01894.x |pmid=23078036 |bibcode=2013RiskA..33.1021W |s2cid=45282555 |issn=0272-4332|url-access=subscription }}</ref><ref>{{Cite journal |last1=Lovreglio |first1=Ruggiero |last2=Borri |first2=Dino |last3=dell’Olio |first3=Luigi |last4=Ibeas |first4=Angel |date=2014-02-01 |title=A discrete choice model based on random utilities for exit choice in emergency evacuations |url=https://www.sciencedirect.com/science/article/pii/S0925753513002294 |journal=Safety Science |volume=62 |pages=418–426 |doi=10.1016/j.ssci.2013.10.004 |issn=0925-7535|url-access=subscription }}</ref> These models help in the development of reliable [[Emergency management|disaster managing plans]] and safer design for the [[built environment]].
 
=== Supervised and unsupervised machine learning ===
Logistic regression is a [[supervised machine learning]] algorithm widely used for [[binary classification]] tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.<ref>{{Cite web |title=Logistic Regression |url=https://www.mastersindatascience.org/learning/machine-learning-algorithms/logistic-regression/ |access-date=2024-03-16 |website=CORP-MIDS1 (MDS) |language=en-US}}</ref>
 
Line 63 ⟶ 64:
 
where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.
 
Note that this model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1).
 
===Fit===
Line 131 ⟶ 134:
{| class="wikitable"
|-
! rowspan="2" | Hours<br />of study<br />(''x'')
! colspan="3" | Passing exam
|-
! Log-odds (''t'') !! Odds (''e<sup>t</sup>'') !! Probability (''p'')
|- style="text-align: right;"
| 1|| −2.57 || 0.076 ≈ 1:13.1 || 0.07
Line 140 ⟶ 143:
| 2|| −1.07 || 0.34 ≈ 1:2.91 || 0.26
|- style="text-align: right;"
|{{tmath|\mu \approx 2.7}} || 0 ||1 || <math>\tfrac{1}{sfrac|1|2}</math>} = 0.50
|- style="text-align: right;"
| 3|| 0.44 || 1.55 || 0.61
Line 160 ⟶ 163:
|- style="text-align:right;"
! Hours (''β''<sub>1</sub>)
| 1.5 || 0.69
|| 21.47 || 0.017
|}
 
Line 225 ⟶ 229:
 
===Multiple explanatory variables===
If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a [[multiple regression]] with ''m'' explanators; the parameters <math>\beta_jbeta_i</math> for all <math>ji = 0, 1, 2, \dots, m</math> are all estimated.
 
Again, the more traditional equations are:
Line 297 ⟶ 301:
:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>
 
where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to the [[Euler's number]] ''e''. In most applications, the base <math>b</math> of the logarithm is usually taken to be ''[[E (mathematical constant)|e]]''. However, in some cases it can be easier to communicate results by working in base 2 or base 10.
 
For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath|(M+1)}}-dimensional vectors:
Line 480 ⟶ 484:
 
These intuitions can be expressed as follows:
{{table alignment}}
 
{|class="wikitable col2right col3left"
|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
|-
Line 490 ⟶ 494:
|-
! Middle-income
| moderate + || weak + || {{CNone|none}}
|-
! Low-income
| {{CNone|none|style=text-align:right;}} || strong + || {{CNone|none}}
|-
|}
Line 544 ⟶ 548:
:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>
 
In order toTo prove that this is equivalent to the previous model, we start by recognizing the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other. As a result, the model is [[nonidentifiable]], in that multiple combinations of '''''β'''''<submath>\boldsymbol\beta_{0}</submath> and '''''β'''''<submath>\boldsymbol\beta_{1}</submath> will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:
 
:<math>
Line 603 ⟶ 607:
 
==Model fitting==
{{expand section|date=October 2016}}
 
===Maximum likelihood estimation (MLE)===
 
The regression coefficients are usually estimated using [[maximum likelihood estimation]].<ref name=Menard/><ref>{{cite journal |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |title=Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models |journal=Journal of Econometrics |volume=17 |issue=1 |year=1981 |pages=83–97 |doi=10.1016/0304-4076(81)90060-9 }}</ref> Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example [[Newton's method]]. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.<ref name="Menard" />
 
In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix|sparseness]], or complete [[Separation (statistics)|separation]].
Line 635 ⟶ 637:
[[File:Logistic-sigmoid-vs-scaled-probit.svg|right|300px|thumb|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin. This shows the [[heavy-tailed distribution|heavier tails]] of the logistic distribution.]]
 
In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s. There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression. When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions. Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler|JAGS]], [[PyMC3PyMC]], [[Stan (software)|Stan]] or [[Turing.jl]] allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern. However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as [[variational Bayesian methods]] and [[expectation propagation]].
 
==="Rule of ten"===
{{main|One in ten rule}}
 
A widelyWidely used rule of thumb, the "[[one in ten rule]]", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where ''event'' denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use <math>k</math> explanatory variables for an event (e.g. [[myocardial infarction]]) expected to occur in a proportion <math>p</math> of participants in the study will require a total of <math>10k/p</math> participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning.<ref>{{cite journal|pmid=27881078|pmc=5122171|year=2016|last1=Van Smeden|first1=M.|title=No rationale for 1 variable per 10 events criterion for binary logistic regression analysis|journal=BMC Medical Research Methodology|volume=16|issue=1|page=163|last2=De Groot|first2=J. A.|last3=Moons|first3=K. G.|last4=Collins|first4=G. S.|last5=Altman|first5=D. G.|last6=Eijkemans|first6=M. J.|last7=Reitsma|first7=J. B.|doi=10.1186/s12874-016-0267-3 |doi-access=free }}</ref> According to some authors<ref>{{cite journal|last=Peduzzi|first=P|author2=Concato, J |author3=Kemper, E |author4=Holford, TR |author5=Feinstein, AR |title=A simulation study of the number of events per variable in logistic regression analysis|journal=[[Journal of Clinical Epidemiology]]|date=December 1996|volume=49|issue=12|pages=1373–9|pmid=8970487|doi=10.1016/s0895-4356(96)00236-3|doi-access=free}}</ref> the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV".<ref>{{cite journal|last1=Vittinghoff|first1=E.|last2=McCulloch|first2=C. E.|title=Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression|journal=American Journal of Epidemiology|date=12 January 2007|volume=165|issue=6|pages=710–718|doi=10.1093/aje/kwk052|pmid=17182981|doi-access=free}}</ref>
 
Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required.<ref name=plo14mod/> Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.<ref name=rms/>
Line 667 ⟶ 669:
which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points.
 
We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be <math>2-1=1</math>. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield ana minimum error less than or equal to the minimum error using the original ''y<sub>k</sub>'', and so we can estimate how significant an improvement is given by the inclusion of the ''x'' variable in the proposed model.
 
For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\varepsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>.
Line 799 ⟶ 801:
The {{math|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.
 
==Machine Maximumlearning and cross-entropy loss function==
 
Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. [[Probit model|probit regression]], [[Poisson regression]], etc.), the logistic regression solution is unique in that it is a [[Maximum entropy probability distribution|maximum entropy]] solution.<ref name="Mount2011">{{cite web |url=http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf |title=The Equivalence of Logistic Regression and Maximum Entropy models |last=Mount |first=J. |date=2011 |website= |publisher= |access-date=Feb 23, 2022 |quote=}}</ref> This is a case of a general property: an [[exponential family]] of distributions maximizes entropy, given an expected value. In the case of the logistic model, the logistic function is the [[natural parameter]] of the Bernoulli distribution (it is in "[[canonical form]]", and the logistic function is the canonical link function), while other sigmoid functions are non-canonical link functions; this underlies its mathematical elegance and ease of optimization. See {{slink|Exponential family|Maximum entropy derivation}} for details.
 
=== Proof ===
 
In order to show this, we use the method of [[Lagrange multipliers]]. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.<ref name="Mount2011"/>
 
As in the above section on [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]], we will consider {{tmath|M+1}} explanatory variables denoted {{tmath|x_m}} and which include <math>x_0=1</math>. There will be a total of ''K'' data points, indexed by <math>k=\{1,2,\dots,K\}</math>, and the data points are given by <math>x_{mk}</math> and {{tmath|y_k}}. The ''x<sub>mk</sub>'' will also be represented as an {{tmath|(M+1)}}-dimensional vector <math>\boldsymbol{x}_k = \{x_{0k},x_{1k},\dots,x_{Mk}\}</math>. There will be {{tmath|N+1}} possible values of the categorical variable ''y'' ranging from 0 to N.
 
Let ''p<sub>n</sub>('''x''')'' be the probability, given explanatory variable vector '''x''', that the outcome will be <math>y=n</math>. Define <math>p_{nk}=p_n(\boldsymbol{x}_k)</math> which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''.
 
The Lagrangian will be expressed as a function of the probabilities ''p<sub>nk</sub>'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to 1 is part of the Lagrangian formulation, rather than being assumed from the beginning.
 
The first contribution to the Lagrangian is the [[Entropy (information theory)|entropy]]:
 
:<math>\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})</math>
 
The log-likelihood is:
 
:<math>\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})</math>
 
Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:
 
:<math>\frac{\partial \ell}{\partial \beta_{nm}}=\sum_{k=1}^K ( p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>
 
A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''p<sub>nk</sub>'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''p<sub>nk</sub>''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then:
 
:<math>\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K (p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>
 
where the ''&lambda;<sub>nm</sub>'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written:
 
:<math>\sum_{n=0}^N p_{nk}=1</math>
 
so that the normalization term in the Lagrangian is:
 
:<math>\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) </math>
 
where the ''α<sub>k</sub>'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:
 
:<math>\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}</math>
 
Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:
 
:<math>\frac{\partial \mathcal{L}}{\partial p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{n'm}x_{mk'})-\alpha_{k'}</math>
 
Using the more condensed vector notation:
 
:<math>\sum_{m=0}^M \lambda_{nm}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k</math>
 
and dropping the primes on the ''n'' and ''k'' indices, and then solving for <math>p_{nk}</math> yields:
 
:<math>p_{nk}=e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}/Z_k</math>
 
where:
 
:<math>Z_k=e^{1+\alpha_k}</math>
 
Imposing the normalization constraint, we can solve for the ''Z<sub>k</sub>'' and write the probabilities as:
 
:<math>p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}</math>
 
The <math>\boldsymbol{\lambda}_n</math> are not all independent. We can add any constant {{tmath|(M+1)}}-dimensional vector to each of the <math>\boldsymbol{\lambda}_n</math> without changing the value of the <math>p_{nk}</math> probabilities so that there are only ''N'' rather than {{tmath|N+1}} independent <math>\boldsymbol{\lambda}_n</math>. In the [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]] section above, the <math>\boldsymbol{\lambda}_0</math> was subtracted from each <math>\boldsymbol{\lambda}_n</math> which set the exponential term involving <math>\boldsymbol{\lambda}_0</math> to 1, and the beta coefficients were given by <math>\boldsymbol{\beta}_n=\boldsymbol{\lambda}_n-\boldsymbol{\lambda}_0</math>.
 
===Other approaches===
 
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.
Line 1,014 ⟶ 952:
|last=Cox|first=David R.
|author-link=David Cox (statistician)
|title=The regression analysis of binary sequences (with discussion)|journal=J R Stat Soc B|date=1958|volume=20|issue=2|pages=215–242|doi=10.1111/j.2517-6161.1958.tb00292.x
|jstor=2983890}}
* {{cite book
|author-link=David Cox (statistician)