Content deleted Content added
Importing Wikidata short description: "Regression for more than two discrete outcomes" |
Greatgalaxy (talk | contribs) Link suggestions feature: 3 links added. |
||
(35 intermediate revisions by 13 users not shown) | |||
Line 6:
In [[statistics]], '''multinomial logistic regression''' is a [[statistical classification|classification]] method that generalizes [[logistic regression]] to [[multiclass classification|multiclass problems]], i.e. with more than two possible discrete outcomes.<ref>{{cite book |last=Greene |first=William H. |author-link=William Greene (economist) |title=Econometric Analysis |edition=Seventh |___location=Boston |publisher=Pearson Education |year=2012 |isbn=978-0-273-75356-8 |pages=803–806 }}</ref> That is, it is a model that is used to predict the probabilities of the different possible outcomes of a [[categorical distribution|categorically distributed]] [[dependent variable]], given a set of [[independent variable]]s (which may be real-valued, binary-valued, categorical-valued, etc.).
Multinomial logistic regression is known by a variety of other names, including '''polytomous LR''',<ref>{{Cite journal | doi = 10.1111/j.1467-9574.1988.tb01238.x| title = Polytomous logistic regression| journal = Statistica Neerlandica| volume = 42| issue = 4| pages = 233–252| year = 1988| last1 = Engel | first1 = J.}}</ref><ref>{{cite book |title=Applied Logistic Regression Analysis |url=https://archive.org/details/appliedlogisticr00mena |url-access=limited |first=Scott |last=Menard |publisher=SAGE |year=2002 |page=[https://archive.org/details/appliedlogisticr00mena/page/n99 91]|isbn=9780761922087 }}</ref> '''multiclass LR''', '''[[Softmax activation function|softmax]] regression''', '''multinomial logit''' ('''mlogit'''), the '''maximum entropy''' ('''MaxEnt''') classifier, and the '''conditional maximum entropy model'''.<ref name="malouf">{{cite conference |first=Robert |last=Malouf |year=2002 |url=http://aclweb.org/anthology/W/W02/W02-2018.pdf |title=A comparison of algorithms for maximum entropy parameter estimation |conference=Sixth Conf. on Natural Language Learning (CoNLL) |pages=49–55}}</ref>
==Background==
Line 18:
==Assumptions==
The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each
If the multinomial logit is used to model choices, it relies on the assumption of [[independence of irrelevant alternatives]] (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of ''K'' alternatives to be modeled as a set of ''K''
If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives.
==Model==
Line 42:
====Data points====
Specifically, it is assumed that we have a series of ''N'' observed data points. Each data point ''i'' (ranging from
Some examples:
Line 53:
:<math>f(k,i) = \beta_{0,k} + \beta_{1,k} x_{1,i} + \beta_{2,k} x_{2,i} + \cdots + \beta_{M,k} x_{M,i},</math>
where <math>\beta_{m,k}</math> is a [[regression coefficient]] associated with the ''m''th explanatory variable and the ''k''th outcome. As explained in the [[logistic regression]] article, the regression coefficients and explanatory variables are normally grouped into vectors of size ''M
:<math>f(k,i) = \boldsymbol\beta_k \cdot \mathbf{x}_i,</math>
where <math>\boldsymbol\beta_k</math> is the set of regression coefficients associated with outcome ''k'', and <math>\mathbf{x}_i</math> (a row vector) is the set of explanatory variables associated with observation ''i'', prepended by a 1 in entry 0.
===As a set of independent binary regressions===
To arrive at the multinomial logit model, one can imagine, for ''K'' possible outcomes, running ''K''
: <math>
\ln \frac{\Pr(Y_i=
\begin{align}▼
</math>.▼
▲\ln \frac{\Pr(Y_i=1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i \\
\ln \frac{\Pr(Y_i=2)}{\Pr(Y_i=K)} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i \\▼
\ln \frac{\Pr(Y_i=K-1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_{K-1} \cdot \mathbf{X}_i \\▼
\end{align}▼
▲</math>
This formulation is also known as the [[Compositional_data#
If we exponentiate both sides
: <math>
▲
\Pr(Y_i=1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \\▼
\Pr(Y_i=2) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \\▼
</math>
Using the fact that all ''K'' of the probabilities must sum to one, we find:
:<math>\Pr(Y_i=K) = 1- \sum_{k=1}^{K-1} \Pr (Y_i = k) = 1 - \sum_{k=1}^{K-1}{\Pr(Y_i=K)}e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \Rightarrow \Pr(Y_i=K) = \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}</math>▼
▲\begin{align}
\Pr(Y_i=K) = {} & 1- \sum_{j=1}^{K-1} \Pr (Y_i = j) \\ = {} & 1 - \sum_{j=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \\
▲
▲\end{align}
</math>
We can use this to find the other probabilities:
:<math>
\Pr(Y_i=
</math>.
▲\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\▼
\Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\▼
▲</math>
The fact that we run multiple regressions reveals why the model relies on the assumption of [[independence of irrelevant alternatives]] described above.
Line 105 ⟶ 93:
===Estimating the coefficients===
The unknown parameters in each vector '''''β'''<sub>k</sub>'' are typically jointly estimated by [[maximum a posteriori]] (MAP) estimation, which is an extension of [[maximum likelihood]] using [[regularization (mathematics)|regularization]] of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean [[Gaussian distribution|Gaussian]] [[prior distribution]] on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as [[generalized iterative scaling]],<ref>{{Cite journal |title=Generalized iterative scaling for log-linear models |author1=Darroch, J.N. |author2=Ratcliff, D. |name-list-style=amp |journal=The Annals of Mathematical Statistics |volume=43 |issue=5 |pages=1470–1480 |year=1972 |url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177692379 |doi=10.1214/aoms/1177692379|doi-access=free }}</ref> [[iteratively reweighted least squares]] (IRLS),<ref>{{cite book |first=Christopher M. |last=Bishop |year=2006 |title=Pattern Recognition and Machine Learning |publisher=Springer |pages=206–209}}</ref> by means of [[gradient-based optimization]] algorithms such as [[L-BFGS]],<ref name="malouf"/> or by specialized [[coordinate descent]] algorithms.<ref>{{cite journal |first1=Hsiang-Fu |last1=Yu |first2=Fang-Lan |last2=Huang |first3=Chih-Jen |last3=Lin |year=2011 |title=Dual coordinate descent methods for logistic regression and maximum entropy models |journal=Machine Learning |volume=85 |issue=1–2 |pages=41–75 |url=http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf |doi=10.1007/s10994-010-5221-8|doi-access=free }}</ref>
===As a log-linear model===
Line 112 ⟶ 100:
: <math>
▲\ln
\ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z \, \\▼
</math>
Line 127 ⟶ 110:
: <math>
▲\Pr(Y_i=
</math>
Line 138 ⟶ 116:
:<math>
▲
▲1 = \sum_{k=1}^{K} \Pr(Y_i=k) &= \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\
</math>
Therefore
:<math>Z = \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.</math>
Note that this factor is "constant" in the sense that it is not a function of ''Y''<sub>''i''</sub>, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients '''''β'''''<sub>''k''</sub>, which we will need to determine through some sort of [[mathematical optimization|optimization]] procedure.
Line 153 ⟶ 128:
:<math>
▲\Pr(Y_i=
</math>
The following function:
:<math>\operatorname{softmax}(k,
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>
:<math>f(k) = \begin{cases}
1
0
\end{cases}
</math>
Line 180 ⟶ 147:
Thus, we can write the probability equations as
:<math>\Pr(Y_i=
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
Note that not all of the <math>\
:<math>
\begin{align}
\frac{e^{(\boldsymbol\
&= \frac{e^{\mathbf{C
&= \frac{e^{\boldsymbol\
\end{align}
</math>
As a result, it is conventional to set <math>\mathbf{C
:<math>
\begin{align}
\boldsymbol\beta'
\boldsymbol\beta'_K &= 0.▼
▲\boldsymbol\beta'_K &= 0
\end{align}
</math>
Line 208 ⟶ 173:
:<math>
▲\Pr(Y_i=
</math>
Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of ''K''
===As a latent-variable model===
Line 222 ⟶ 182:
It is also possible to formulate multinomial logistic regression as a latent variable model, following the [[Logistic regression#Two-way latent-variable model|two-way latent variable model]] described for binary logistic regression. This formulation is common in the theory of [[discrete choice]] models, and makes it easier to compare multinomial logistic regression to the related [[multinomial probit]] model, as well as to extend it to more complex models.
Imagine that, for each data point ''i'' and possible outcome ''k'' = 1,2,...,''K'', there is a continuous [[latent variable]] ''Y''<sub>''i,k''</sub><sup>''*''</sup> (i.e. an unobserved [[random variable]]) that is distributed as follows:
: <math>
▲Y_{i,k}^{\
</math>
where <math>\varepsilon_k \sim \operatorname{EV}_1(0,1),</math> i.e. a standard type-1 [[extreme value distribution]].
This latent variable can be thought of as the [[utility]] associated with data point ''i'' choosing outcome ''k'', where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable <math>Y_i</math> is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome ''k'' is chosen [[if and only if]] the associated utility (the value of <math>Y_{i,k}^{\ast}</math>) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome ''k'' is the maximum of all the utilities. Since the latent variables are [[continuous variable|continuous]], the probability of two having exactly the same value is 0, so we ignore the scenario. That is:
: <math>
Line 241 ⟶ 196:
\Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,2}^{\ast} \text{ and } Y_{i,1}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,1}^{\ast} > Y_{i,K}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(Y_{i,2}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,2}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,2}^{\ast} > Y_{i,K}^{\ast}) \\
& \,\,\,\vdots \\
\Pr(Y_i = K) &= \Pr(Y_{i,K}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,K}^{\ast} > Y_{i,2}^{\ast}\text{ and } \cdots \text{ and } Y_{i,K}^{\ast} > Y_{i,K-1}^{\ast}) \\
\end{align}
Line 249 ⟶ 204:
: <math>
\Pr(Y_i =
▲\Pr(Y_i = 1) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1}^{\ast}) \\
</math>
Line 271 ⟶ 221:
#In general, if <math>X \sim \operatorname{EV}_1(a,b)</math> and <math>Y \sim \operatorname{EV}_1(a,b)</math> then <math>X - Y \sim \operatorname{Logistic}(0,b).</math> That is, the difference of two [[independent identically distributed]] extreme-value-distributed variables follows the [[logistic distribution]], where the first parameter is unimportant. This is understandable since the first parameter is a [[___location parameter]], i.e. it shifts the mean by a fixed amount, and if two values are both shifted by the same amount, their difference remains the same. This means that all of the relational statements underlying the probability of a given choice involve the logistic distribution, which makes the initial choice of the extreme-value distribution, which seemed rather arbitrary, somewhat more understandable.
#The second parameter in an extreme-value or logistic distribution is a [[scale parameter]], such that if <math>X \sim \operatorname{Logistic}(0,1)</math> then <math>bX \sim \operatorname{Logistic}(0,b).</math> This means that the effect of using an error variable with an arbitrary scale parameter in place of scale 1 can be compensated simply by multiplying all regression vectors by the same scale. Together with the previous point, this shows that the use of a standard extreme-value distribution (___location 0, scale 1) for the error variables entails no loss of generality over using an arbitrary extreme-value distribution. In fact, the model is [[nonidentifiable]] (no single set of optimal coefficients) if the more general distribution is used.
#Because only differences of vectors of regression coefficients are used, adding an arbitrary constant to all coefficient vectors has no effect on the model. This means that, just as in the log-linear model, only ''K''
Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular [[order statistic]] (the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent.
Line 277 ⟶ 227:
==Estimation of intercept==
When using multinomial logistic regression, one category of the dependent variable is chosen as the reference category. Separate [[odds ratio]]s are determined for all independent variables for each category of the dependent variable with the exception of the reference category, which is omitted from the analysis. The exponential beta coefficient represents the change in the odds of the dependent variable being in a particular category vis-a-vis the reference category, associated with a one unit change of the corresponding independent variable.
== Likelihood function ==
The observed values <math>y_i \in \{1,\dots,K\}</math> for <math>i=1,\dots,n</math> of the explained variables are considered as realizations of stochastically independent, [[Categorical distribution|categorically distributed]] random variables <math>Y_1,\dots, Y_n</math>.
The [[likelihood function]] for this model is defined by
:<math>L = \prod_{i=1}^n P(Y_i=y_i) = \prod_{i=1}^n \prod_{j=1}^K P(Y_i=j)^{\delta_{j,y_i}},</math>
where the index <math>i</math> denotes the observations 1 to ''n'' and the index <math>j</math> denotes the classes 1 to ''K''. <math>\delta_{j,y_i}=\begin{cases}1, \text{ for } j=y_i \\ 0, \text{ otherwise}\end{cases}</math> is the [[Kronecker delta]].
The negative log-likelihood function is therefore the well-known cross-entropy:
:<math>-\log L = - \sum_{i=1}^n \sum_{j=1}^K \delta_{j,y_i} \log(P(Y_i=j))= - \sum_{j=1}^K\sum_{y_i=j}\log(P(Y_i=j)).</math>
==Application in natural language processing==
In [[natural language processing]], multinomial LR classifiers are commonly used as an alternative to [[naive Bayes classifier]]s because they do not assume [[statistical independence]] of the random variables (commonly known as ''features'') that serve as predictors. However, learning in such a model is slower than for a naive Bayes classifier, and thus may not be appropriate given a very large number of classes to learn. In particular, learning in a
==See also==
|