Content deleted Content added
m v2.05b - Bot T18 CW#553 - Fix errors for CW project (<nowiki> tags) |
Adumbrativus (talk | contribs) Remove poorly sourced material about maximum entropy based on concerns at the talk page |
||
(47 intermediate revisions by 34 users not shown) | |||
Line 2:
{{Redirect-distinguish|Logit model|Logit function}}
[[File:Exam pass logistic curve.svg|thumb|400px|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink||Example}} for worked details.]]
In [[statistics]],
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink||Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn|Cramer|2002|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink||Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]].
Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink||Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability
The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)|linear least squares]]; see {{section link||Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink||Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn|Cramer|2002|p=8}} beginning in {{harvtxt|Berkson|1944}}, where he coined "logit"; see {{slink||History}}.
Line 14 ⟶ 15:
==Applications==
Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score ([[TRISS]]), which is widely used to predict mortality in injured patients, was originally developed by Boyd ''{{Abbr|et al.|''et alia'', with others - usually other authors}}'' using logistic regression.<ref>{{cite journal| last1 = Boyd | first1 = C. R.| last2 = Tolson | first2 = M. A.| last3 = Copes | first3 = W. S.| title = Evaluating trauma care: The TRISS method. Trauma Score and the Injury Severity Score| journal = The Journal of Trauma| volume = 27 | issue = 4| pages = 370–378| year = 1987 | pmid = 3106646 | doi= 10.1097/00005373-198704000-00005| doi-access = free}}</ref> Many other medical scales used to assess severity of a patient have been developed using logistic regression.<ref>{{cite journal |pmid= 11268952 |year= 2001|last1= Kologlu |first1= M.|title=Validation of MPI and PIA II in two different groups of patients with secondary peritonitis |journal=Hepato-Gastroenterology |volume= 48 |issue=37 |pages= 147–51 |last2=Elker|first2=D. |last3= Altun |first3= H. |last4= Sayek |first4= I.}}</ref><ref>{{cite journal |pmid= 11129812 |year= 2000 |last1= Biondo |first1= S. |title= Prognostic factors for mortality in left colonic peritonitis: A new scoring system |journal= Journal of the American College of Surgeons|volume= 191 |issue= 6 |pages= 635–42 |last2= Ramos|first2=E.|last3=Deiros |first3= M. |last4=Ragué|first4=J. M.|last5=De Oca |first5= J. |last6= Moreno |first6=P.|last7=Farran|first7=L.|last8= Jaurrieta |first8= E. |doi= 10.1016/S1072-7515(00)00758-4}}</ref><ref>{{cite journal|pmid=7587228 |year= 1995 |last1=Marshall |first1= J. C.|title=Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome|journal=Critical Care Medicine|volume= 23 |issue= 10|pages= 1638–52 |last2= Cook|first2=D. J.|last3=Christou|first3=N. V. |last4= Bernard |first4= G. R. |last5=Sprung|first5=C. L.|last6=Sibbald|first6=W. J.|doi= 10.1097/00003246-199510000-00007}}</ref><ref>{{cite journal|pmid=8254858|year=1993 |last1= Le Gall |first1= J. R.|title=A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study|journal=JAMA|volume=270|issue= 24 |pages= 2957–63 |last2= Lemeshow |first2=S.|last3=Saulnier|first3=F.|doi= 10.1001/jama.1993.03510240069035}}</ref> Logistic regression may be used to predict the risk of developing a given disease (e.g. [[Diabetes mellitus|diabetes]]; [[Coronary artery disease|coronary heart disease]]), based on observed characteristics of the patient (age, sex, [[body mass index]], results of various [[blood test]]s, etc.).<ref name = "Freedman09">{{cite book |author=David A. Freedman |year=2009|title=Statistical Models: Theory and Practice |publisher=[[Cambridge University Press]]|page=128|author-link=David A. Freedman}}</ref><ref>{{cite journal | pmid = 6028270▼
▲Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score ([[TRISS]]), which is widely used to predict mortality in injured patients, was originally developed by Boyd ''{{Abbr|et al.|''et alia'', with others - usually other authors}}'' using logistic regression.<ref>{{cite journal| last1 = Boyd | first1 = C. R.| last2 = Tolson | first2 = M. A.| last3 = Copes | first3 = W. S.| title = Evaluating trauma care: The TRISS method. Trauma Score and the Injury Severity Score| journal = The Journal of Trauma| volume = 27 | issue = 4| pages = 370–378| year = 1987 | pmid = 3106646 | doi= 10.1097/00005373-198704000-00005| doi-access = free}}</ref> Many other medical scales used to assess severity of a patient have been developed using logistic regression.<ref>{{cite journal |pmid= 11268952 |year= 2001|last1= Kologlu |first1= M.|title=Validation of MPI and PIA II in two different groups of patients with secondary peritonitis |journal=Hepato-Gastroenterology |volume= 48 |issue=37 |pages= 147–51 |last2=Elker|first2=D. |last3= Altun |first3= H. |last4= Sayek |first4= I.}}</ref><ref>{{cite journal |pmid= 11129812 |year= 2000 |last1= Biondo |first1= S. |title= Prognostic factors for mortality in left colonic peritonitis: A new scoring system |journal= Journal of the American College of Surgeons|volume= 191 |issue= 6 |pages= 635–42 |last2= Ramos|first2=E.|last3=Deiros |first3= M. |last4=Ragué|first4=J. M.|last5=De Oca |first5= J. |last6= Moreno |first6=P.|last7=Farran|first7=L.|last8= Jaurrieta |first8= E. |doi= 10.1016/S1072-7515(00)00758-4}}</ref><ref>{{cite journal|pmid=7587228 |year= 1995 |last1=Marshall |first1= J. C.|title=Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome|journal=Critical Care Medicine|volume= 23 |issue= 10|pages= 1638–52 |last2= Cook|first2=D. J.|last3=Christou|first3=N. V. |last4= Bernard |first4= G. R. |last5=Sprung|first5=C. L.|last6=Sibbald|first6=W. J.|doi= 10.1097/00003246-199510000-00007}}</ref><ref>{{cite journal|pmid=8254858|year=1993 |last1= Le Gall |first1= J. R.|title=A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study|journal=JAMA|volume=270|issue= 24 |pages= 2957–63 |last2= Lemeshow |first2=S.|last3=Saulnier|first3=F.|doi= 10.1001/jama.1993.03510240069035}}</ref> Logistic regression may be used to predict the risk of developing a given disease (e.g. [[Diabetes mellitus|diabetes]]; [[Coronary artery disease|coronary heart disease]]), based on observed characteristics of the patient (age, sex, [[body mass index]], results of various [[blood test]]s, etc.).<ref name
| year = 1967
| last1 = Truett
Line 23 ⟶ 26:
| issue = 7
| pages = 511–24
| last2 = Cornfield| first2 = J| last3 = Kannel| first3 = W | doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or
=== Supervised machine learning ===
Logistic regression is a [[supervised machine learning]] algorithm widely used for [[binary classification]] tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.<ref>{{Cite web |title=Logistic Regression |url=https://www.mastersindatascience.org/learning/machine-learning-algorithms/logistic-regression/ |access-date=2024-03-16 |website=CORP-MIDS1 (MDS) |language=en-US}}</ref>
==Example==
Line 58 ⟶ 64:
where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.
Note that this model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1).
===Fit===
Line 126 ⟶ 134:
{| class="wikitable"
|-
! rowspan="2" | Hours<br />of study<br />(''x'')
! colspan="3" | Passing exam
|-
! Log-odds (''t'') !! Odds (''e<sup>t</sup>'') !! Probability (''p'')
|- style="text-align: right;"
| 1|| −2.57 || 0.076 ≈ 1:13.1 || 0.07
Line 135 ⟶ 143:
| 2|| −1.07 || 0.34 ≈ 1:2.91 || 0.26
|- style="text-align: right;"
|{{tmath|\mu \approx 2.7}} || 0 ||1 ||
|- style="text-align: right;"
| 3|| 0.44 || 1.55 || 0.61
Line 155 ⟶ 163:
|- style="text-align:right;"
! Hours (''β''<sub>1</sub>)
| 1.5 || 0.
|| |}
Line 220 ⟶ 229:
===Multiple explanatory variables===
If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a [[multiple regression]] with ''m'' explanators; the parameters <math>\
Again, the more traditional equations are:
Line 233 ⟶ 242:
==Definition==
As in linear regression, the outcome variables ''Y''<sub>''i''</sub> are assumed to depend on the explanatory variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub>.
Line 247 ⟶ 256:
::<math>
\begin{align}
Y_i\mid x_{1,i},\ldots,x_{m,i} \ & \sim \operatorname{Bernoulli}(p_i) \\[5pt]
\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}] &= p_i \\[5pt]
\Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &=
\begin{cases}
Line 254 ⟶ 263:
1-p_i & \text{if }y=0
\end{cases}
\\[5pt]
\Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= p_i^y (1-p_i)^{(1-y)}
\end{align}
Line 292 ⟶ 301:
:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>
where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to
For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath|(M+1)}}-dimensional vectors:
Line 393 ⟶ 402:
Then ''Y''<sub>''i''</sub> can be viewed as an indicator for whether this latent variable is positive:
: <math> Y_i = \begin{cases} 1 & \text{if }Y_i^\ast > 0 \ \text{ i.e. } {- \varepsilon_i} < \boldsymbol\beta \cdot \mathbf{X}_i, \\
0 &\text{otherwise.} \end{cases} </math>
Line 475 ⟶ 484:
These intuitions can be expressed as follows:
{{table alignment}}
{|class="wikitable col2right col3left"
|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
|-
Line 485 ⟶ 494:
|-
! Middle-income
| moderate + || weak + || {{CNone|none}}
|-
! Low-income
| {{CNone|none|style=text-align:right;}} || strong + || {{CNone|none}}
|-
|}
Line 539 ⟶ 548:
:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>
:<math>
Line 598 ⟶ 607:
==Model fitting==
===Maximum likelihood estimation (MLE)===
The regression coefficients are usually estimated using [[maximum likelihood estimation]].<ref name=Menard/><ref>{{cite journal |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |title=Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models |journal=Journal of Econometrics |volume=17 |issue=1 |year=1981 |pages=83–97 |doi=10.1016/0304-4076(81)90060-9 }}</ref> Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function
In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix|sparseness]], or complete [[Separation (statistics)|separation]].
Line 630 ⟶ 637:
[[File:Logistic-sigmoid-vs-scaled-probit.svg|right|300px|thumb|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin. This shows the [[heavy-tailed distribution|heavier tails]] of the logistic distribution.]]
In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s. There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression. When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions. Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler|JAGS]], [[
==="Rule of ten"===
{{main|One in ten rule}}
Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required.<ref name=plo14mod/> Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.<ref name=rms/>
Line 648 ⟶ 655:
Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of ''K'' data points (''x<sub>k</sub>'', ''y<sub>k</sub>'') are fitted to a proposed model function of the form <math>y=b_0+b_1 x</math>. The fit is obtained by choosing the ''b'' parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point:
:<math>\
The minimum value which constitutes the fit will be denoted by <math>\hat{\
The idea of a [[null model]] may be introduced, in which it is assumed that the ''x'' variable is of no use in predicting the y<sub>k</sub> outcomes: The data points are fitted to a null model function of the form ''y'' = ''b''<sub>0</sub>
:<math>\
The fitting process consists of choosing a value of ''b''<sub>0</sub>
:<math>\hat{\
which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points.
We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be <math>2-1=1</math>. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield
For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\
In the case of simple binary logistic regression, the set of ''K'' data points are fitted in a probabilistic sense to a function of the form:
Line 708 ⟶ 715:
which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (''K'') increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the ''x<sub>k</sub>'' data points in the proposed model.
For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is <math>\hat{\ell}_\varphi= -13.8629
Using the [[chi-squared test]] of significance, the integral of the [[chi-squared distribution]] with one degree of freedom from 11.6661... to infinity is equal to 0.00063649...
Line 794 ⟶ 801:
The {{math|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.
==Machine
▲=== Proof ===
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.
Line 915 ⟶ 858:
In the 1930s, the [[probit model]] was developed and systematized by [[Chester Ittner Bliss]], who coined the term "probit" in {{harvtxt|Bliss|1934}}, and by [[John Gaddum]] in {{harvtxt|Gaddum|1933}}, and the model fit by [[maximum likelihood estimation]] by [[Ronald A. Fisher]] in {{harvtxt|Fisher|1935}}, as an addendum to Bliss's work. The probit model was principally used in [[bioassay]], and had been preceded by earlier work dating to 1860; see {{slink|Probit model|History}}. The probit model influenced the subsequent development of the logit model and these models competed with each other.{{sfn|Cramer|2002|p=7–9}}
The logistic model was likely first used as an alternative to the probit model in bioassay by [[Edwin Bidwell Wilson]] and his student [[Jane Worcester]] in {{harvtxt|Wilson|Worcester|1943}}.{{sfn|Cramer|2002|p=9}} However, the development of the logistic model as a general alternative to the probit model was principally due to the work of [[Joseph Berkson]] over many decades, beginning in {{harvtxt|Berkson|1944}}, where he coined "logit", by analogy with "probit", and continuing through {{harvtxt|Berkson|1951}} and following years.<ref>{{harvnb|Cramer|2002|p=8|ps=, "As far as I can see the introduction of the logistics as an alternative to the normal probability function is the work of a single person, Joseph Berkson (1899–1982), ..."}}</ref> The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the
Various refinements occurred during that time, notably by [[David Cox (statistician)|David Cox]], as in {{harvtxt|Cox|1958}}.<ref name=wal67est>{{cite journal|last1=Walker|first1=SH|last2=Duncan|first2=DB|title=Estimation of the probability of an event as a function of several independent variables|journal=Biometrika|date=1967|volume=54|issue=1/2|pages=167–178|doi=10.2307/2333860|jstor=2333860}}</ref>
Line 1,009 ⟶ 952:
|last=Cox|first=David R.
|author-link=David Cox (statistician)
|title=The regression analysis of binary sequences (with discussion)|journal=J R Stat Soc B|date=1958|volume=20|issue=2|pages=215–242|doi=10.1111/j.2517-6161.1958.tb00292.x
|jstor=2983890}} * {{cite book
|author-link=David Cox (statistician)
|