Content deleted Content added
→"Rule of ten": removed "rule of thumb" phrase (carries distasteful historical reference) |
Adumbrativus (talk | contribs) Remove poorly sourced material about maximum entropy based on concerns at the talk page |
||
(33 intermediate revisions by 26 users not shown) | |||
Line 2:
{{Redirect-distinguish|Logit model|Logit function}}
[[File:Exam pass logistic curve.svg|thumb|400px|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See {{slink||Example}} for worked details.]]
In [[statistics]],
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see {{slink||Applications}}), and the logistic model has been the most commonly used model for [[binary regression]] since about 1970.{{sfn|Cramer|2002|p=10–11}} Binary variables can be generalized to [[categorical variable]]s when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to [[multinomial logistic regression]]. If the multiple categories are [[Level of measurement#Ordinal scale|ordered]], one can use the [[ordinal logistic regression]] (for example the proportional odds ordinal logistic model<ref name=wal67est />). See {{slink||Extensions}} for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform [[statistical classification]] (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a [[binary classifier]].
Analogous linear models for binary variables with a different [[sigmoid function]] instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the [[probit model]]; see {{slink||Alternatives}}. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a ''constant'' rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the [[odds ratio]]. More abstractly, the logistic function is the [[natural parameter]] for the [[Bernoulli distribution]], and in this sense is the "simplest" way to convert a real number to a probability
The parameters of a logistic regression are most commonly estimated by [[maximum-likelihood estimation]] (MLE). This does not have a closed-form expression, unlike [[linear least squares (mathematics)|linear least squares]]; see {{section link||Model fitting}}. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by [[ordinary least squares]] (OLS) plays for [[Scalar (mathematics)|scalar]] responses: it is a simple, well-analyzed baseline model; see {{slink||Comparison with linear regression}} for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by [[Joseph Berkson]],{{sfn|Cramer|2002|p=8}} beginning in {{harvtxt|Berkson|1944}}, where he coined "logit"; see {{slink||History}}.
Line 25 ⟶ 26:
| issue = 7
| pages = 511–24
| last2 = Cornfield| first2 = J| last3 = Kannel| first3 = W | doi= 10.1016/0021-9681(67)90082-3}}</ref> Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or
=== Supervised machine learning ===
Line 63 ⟶ 64:
where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.
Note that this model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1).
===Fit===
Line 131 ⟶ 134:
{| class="wikitable"
|-
! rowspan="2" | Hours<br />of study<br />(''x'')
! colspan="3" | Passing exam
|-
! Log-odds (''t'') !! Odds (''e<sup>t</sup>'') !! Probability (''p'')
|- style="text-align: right;"
| 1|| −2.57 || 0.076 ≈ 1:13.1 || 0.07
Line 140 ⟶ 143:
| 2|| −1.07 || 0.34 ≈ 1:2.91 || 0.26
|- style="text-align: right;"
|{{tmath|\mu \approx 2.7}} || 0 ||1 ||
|- style="text-align: right;"
| 3|| 0.44 || 1.55 || 0.61
Line 160 ⟶ 163:
|- style="text-align:right;"
! Hours (''β''<sub>1</sub>)
| 1.5 || 0.
|| |}
Line 225 ⟶ 229:
===Multiple explanatory variables===
If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a [[multiple regression]] with ''m'' explanators; the parameters <math>\
Again, the more traditional equations are:
Line 297 ⟶ 301:
:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>
where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to
For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath|(M+1)}}-dimensional vectors:
Line 480 ⟶ 484:
These intuitions can be expressed as follows:
{{table alignment}}
{|class="wikitable col2right col3left"
|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
|-
Line 490 ⟶ 494:
|-
! Middle-income
| moderate + || weak + || {{CNone|none}}
|-
! Low-income
| {{CNone|none|style=text-align:right;}} || strong + || {{CNone|none}}
|-
|}
Line 544 ⟶ 548:
:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>
:<math>
Line 633 ⟶ 637:
[[File:Logistic-sigmoid-vs-scaled-probit.svg|right|300px|thumb|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin. This shows the [[heavy-tailed distribution|heavier tails]] of the logistic distribution.]]
In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s. There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression. When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions. Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler|JAGS]], [[
==="Rule of ten"===
Line 665 ⟶ 669:
which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points.
We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be <math>2-1=1</math>. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield
For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\varepsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>.
Line 797 ⟶ 801:
The {{math|logit}} of the probability of success is then fitted to the predictors. The predicted value of the {{math|logit}} is converted back into predicted odds, via the inverse of the natural logarithm – the [[exponential function]]. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.
==Machine
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.
Line 1,012 ⟶ 952:
|last=Cox|first=David R.
|author-link=David Cox (statistician)
|title=The regression analysis of binary sequences (with discussion)|journal=J R Stat Soc B|date=1958|volume=20|issue=2|pages=215–242|doi=10.1111/j.2517-6161.1958.tb00292.x
|jstor=2983890}} * {{cite book
|author-link=David Cox (statistician)
|