Content deleted Content added
→Model: remark: this model is an oversimplification Tag: Reverted |
Adumbrativus (talk | contribs) Remove poorly sourced material about maximum entropy based on concerns at the talk page |
||
(37 intermediate revisions by 30 users not shown) | |||
Line 2:
{{Redirect-distinguish|Logit model|Logit function}}
[[File:Exam pass logistic curve.svg|thumb|400px|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink||Example}} for worked details.]]
In [[statistics]],
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink||Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn|Cramer|2002|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink||Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]].
Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink||Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability
The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)|linear least squares]]; see {{section link||Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink||Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn|Cramer|2002|p=8}} beginning in {{harvtxt|Berkson|1944}}, where he coined "logit"; see {{slink||History}}.
Line 25 ⟶ 26:
| issue = 7
| pages = 511–24
| last2 = Cornfield| first2 = J| last3 = Kannel| first3 = W | doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or
=== Supervised machine learning ===
Line 64 ⟶ 65:
where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.
===Fit===
Line 133 ⟶ 134:
{| class="wikitable"
|-
! rowspan="2" | Hours<br />of study<br />(''x'')
! colspan="3" | Passing exam
|-
! Log-odds (''t'') !! Odds (''e<sup>t</sup>'') !! Probability (''p'')
|- style="text-align: right;"
| 1|| −2.57 || 0.076 ≈ 1:13.1 || 0.07
Line 142 ⟶ 143:
| 2|| −1.07 || 0.34 ≈ 1:2.91 || 0.26
|- style="text-align: right;"
|{{tmath|\mu \approx 2.7}} || 0 ||1 ||
|- style="text-align: right;"
| 3|| 0.44 || 1.55 || 0.61
Line 162 ⟶ 163:
|- style="text-align:right;"
! Hours (''β''<sub>1</sub>)
| 1.5 || 0.
|| |}
Line 227 ⟶ 229:
===Multiple explanatory variables===
If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a [[multiple regression]] with ''m'' explanators; the parameters <math>\
Again, the more traditional equations are:
Line 299 ⟶ 301:
:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>
where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to
For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath|(M+1)}}-dimensional vectors:
Line 482 ⟶ 484:
These intuitions can be expressed as follows:
{{table alignment}}
{|class="wikitable col2right col3left"
|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
|-
Line 492 ⟶ 494:
|-
! Middle-income
| moderate + || weak + || {{CNone|none}}
|-
! Low-income
| {{CNone|none|style=text-align:right;}} || strong + || {{CNone|none}}
|-
|}
Line 546 ⟶ 548:
:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>
:<math>
Line 605 ⟶ 607:
==Model fitting==
===Maximum likelihood estimation (MLE)===
The regression coefficients are usually estimated using [[maximum likelihood estimation]].<ref name=Menard/><ref>{{cite journal |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |title=Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models |journal=Journal of Econometrics |volume=17 |issue=1 |year=1981 |pages=83–97 |doi=10.1016/0304-4076(81)90060-9 }}</ref> Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function
In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix|sparseness]], or complete [[Separation (statistics)|separation]].
Line 637:
[[File:Logistic-sigmoid-vs-scaled-probit.svg|right|300px|thumb|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin. This shows the [[heavy-tailed distribution|heavier tails]] of the logistic distribution.]]
In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s. There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression. When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions. Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler|JAGS]], [[
==="Rule of ten"===
{{main|One in ten rule}}
Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required.<ref name=plo14mod/> Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.<ref name=rms/>
Line 669:
which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points.
We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be <math>2-1=1</math>. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield
For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\varepsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>.
Line 801:
The {{math|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.
==Machine
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.
Line 1,016 ⟶ 952:
|last=Cox|first=David R.
|author-link=David Cox (statistician)
|title=The regression analysis of binary sequences (with discussion)|journal=J R Stat Soc B|date=1958|volume=20|issue=2|pages=215–242|doi=10.1111/j.2517-6161.1958.tb00292.x
|jstor=2983890}} * {{cite book
|author-link=David Cox (statistician)
|