Multinomial logistic regression: Difference between revisions

Content deleted Content added
Importing Wikidata short description: "Regression for more than two discrete outcomes"
Link suggestions feature: 3 links added.
 
(35 intermediate revisions by 13 users not shown)
Line 6:
In [[statistics]], '''multinomial logistic regression''' is a [[statistical classification|classification]] method that generalizes [[logistic regression]] to [[multiclass classification|multiclass problems]], i.e. with more than two possible discrete outcomes.<ref>{{cite book |last=Greene |first=William H. |author-link=William Greene (economist) |title=Econometric Analysis |edition=Seventh |___location=Boston |publisher=Pearson Education |year=2012 |isbn=978-0-273-75356-8 |pages=803–806 }}</ref> That is, it is a model that is used to predict the probabilities of the different possible outcomes of a [[categorical distribution|categorically distributed]] [[dependent variable]], given a set of [[independent variable]]s (which may be real-valued, binary-valued, categorical-valued, etc.).
 
Multinomial logistic regression is known by a variety of other names, including '''polytomous LR''',<ref>{{Cite journal | doi = 10.1111/j.1467-9574.1988.tb01238.x| title = Polytomous logistic regression| journal = Statistica Neerlandica| volume = 42| issue = 4| pages = 233–252| year = 1988| last1 = Engel | first1 = J.}}</ref><ref>{{cite book |title=Applied Logistic Regression Analysis |url=https://archive.org/details/appliedlogisticr00mena |url-access=limited |first=Scott |last=Menard |publisher=SAGE |year=2002 |page=[https://archive.org/details/appliedlogisticr00mena/page/n99 91]|isbn=9780761922087 }}</ref> '''multiclass LR''', '''[[Softmax activation function|softmax]] regression''', '''multinomial logit''' ('''mlogit'''), the '''maximum entropy''' ('''MaxEnt''') classifier, and the '''conditional maximum entropy model'''.<ref name="malouf">{{cite conference |first=Robert |last=Malouf |year=2002 |url=http://aclweb.org/anthology/W/W02/W02-2018.pdf |title=A comparison of algorithms for maximum entropy parameter estimation |conference=Sixth Conf. on Natural Language Learning (CoNLL) |pages=49–55}}</ref>
 
==Background==
Line 18:
 
==Assumptions==
The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. The multinomial logistic model also assumes that the dependent variable cannot be perfectly predicted from the independent variables for any case. As with other types of regression, there is no need for the independent variables to be [[statistically independent]] from each other (unlike, for example, in a [[naive Bayes classifier]]); however, [[multicollinearity|collinearity]] is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case.<ref>{{cite book | last = Belsley | first = David | title = Conditioning diagnostics : collinearity and weak data in regression | publisher = Wiley | ___location = New York | year = 1991 | isbn = 9780471528890 }}</ref>
 
If the multinomial logit is used to model choices, it relies on the assumption of [[independence of irrelevant alternatives]] (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of ''K'' alternatives to be modeled as a set of ''K''-&nbsp;−&nbsp;1 independent binary choices, in which one alternative is chosen as a "pivot" and the other ''K''-&nbsp;−&nbsp;1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a [[perfect substitute]] for a blue bus.
 
If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. This pointIt is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the [[nested logit]] or the [[multinomial probit]] may be used in such cases as they allow for violation of the IIA.<ref>{{cite journal |lastlast1=Baltas |firstfirst1=G. |last2=Doyle |first2=P. |year=2001 |title=Random Utility Models in Marketing Research: A Survey |journal=[[Journal of Business Research]] |volume=51 |issue=2 |pages=115–125 |doi=10.1016/S0148-2963(99)00058-2 }}</ref>
 
==Model==
Line 42:
 
====Data points====
Specifically, it is assumed that we have a series of ''N'' observed data points. Each data point ''i'' (ranging from ''1'' to ''N'') consists of a set of ''M'' explanatory variables ''x''<sub>''1,''i''</sub> ... ''x''<sub>''M,i''</sub> (akaalso known as [[independent variable]]s, predictor variables, features, etc.), and an associated [[categorical variable|categorical]] outcome ''Y''<sub>''i''</sub> (akaalso known as [[dependent variable]], response variable), which can take on one of ''K'' possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to ''K''. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of ''N'' "experiments" — although an "experiment" may consist inof nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so that the outcome of a new "experiment" can be correctly predicted for a new data point for which the explanatory variables, but not the outcome, are available. In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome.
 
Some examples:
Line 53:
:<math>f(k,i) = \beta_{0,k} + \beta_{1,k} x_{1,i} + \beta_{2,k} x_{2,i} + \cdots + \beta_{M,k} x_{M,i},</math>
 
where <math>\beta_{m,k}</math> is a [[regression coefficient]] associated with the ''m''th explanatory variable and the ''k''th outcome. As explained in the [[logistic regression]] article, the regression coefficients and explanatory variables are normally grouped into vectors of size ''M+1''&nbsp;+&nbsp;1, so that the predictor function can be written more compactly:
 
:<math>f(k,i) = \boldsymbol\beta_k \cdot \mathbf{x}_i,</math>
 
where <math>\boldsymbol\beta_k</math> is the set of regression coefficients associated with outcome ''k'', and <math>\mathbf{x}_i</math> (a row vector) is the set of explanatory variables associated with observation ''i'', prepended by a 1 in entry 0.
 
===As a set of independent binary regressions===
To arrive at the multinomial logit model, one can imagine, for ''K'' possible outcomes, running ''K''-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other ''K''-&nbsp;−&nbsp;1 outcomes are separately regressed against the pivot outcome. This would proceed as follows, ifIf outcome ''K'' (the last outcome) is chosen as the pivot, the ''K''&nbsp;−&nbsp;1 regression equations are:
 
: <math>
\ln \frac{\Pr(Y_i=1k)}{\Pr(Y_i=K)} &\,=\, \boldsymbol\beta_1beta_k \cdot \mathbf{X}_i, \;\;\;\;\;\;1\leq k < K
\begin{align}
</math>.
\ln \frac{\Pr(Y_i=1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i \\
\ln \frac{\Pr(Y_i=2)}{\Pr(Y_i=K)} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i \\
\cdots & \cdots \\
\ln \frac{\Pr(Y_i=K-1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_{K-1} \cdot \mathbf{X}_i \\
\end{align}
</math>
 
This formulation is also known as the [[Compositional_data#Additive_logratio_transformAdditive_log_ratio_transform|alrAdditive Log Ratio]] transform commonly used in compositional data analysis. In other applications it’s referred to as “relative risk”.<ref>[https://www.stata.com/manuals13/rmlogit.pdf Stata Manual “mlogit — Multinomial (polytomous) logistic regression”]</ref>
Note that we have introduced separate sets of regression coefficients, one for each possible outcome.
 
If we exponentiate both sides, and solve for the probabilities, we get:
 
: <math>
\ln \frac{\Pr(Y_i=2k)} \,=\, {\Pr(Y_i=K)} &= \;e^{\boldsymbol\beta_2beta_k \cdot \mathbf{X}_i}, \;\;\;\;\;\;1\leq k < K
\begin{align}
\Pr(Y_i=1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \\
\Pr(Y_i=2) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \\
\cdots & \cdots \\
\Pr(Y_i=K-1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i} \\
\end{align}
</math>
 
Using the fact that all ''K'' of the probabilities must sum to one, we find:
 
:</math>
:<math>\Pr(Y_i=K) = 1- \sum_{k=1}^{K-1} \Pr (Y_i = k) = 1 - \sum_{k=1}^{K-1}{\Pr(Y_i=K)}e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \Rightarrow \Pr(Y_i=K) = \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}</math>
\begin{align}
\Pr(Y_i=K) = {} & 1- \sum_{j=1}^{K-1} \Pr (Y_i = j) \\ = {} & 1 - \sum_{j=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \\
\Pr(Y_i=2) {} &= \frac{1}{1 + \Pr(Y_isum_{j=1}^{K)-1} e^{\boldsymbol\beta_2beta_j \cdot \mathbf{X}_i} \\}.
\end{align}
</math>
 
We can use this to find the other probabilities:
 
:<math>
\Pr(Y_i=1k) &= \frac{e^{\boldsymbol\beta_1beta_k \cdot \mathbf{X}_i}}{1 + \sum_{kj=1}^{K-1} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}}, \;\;\;\;\;\;1\leq k < K
\begin{align}
</math>.
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\\
\Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\cdots & \cdots \\
\Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\end{align}
</math>
 
The fact that we run multiple regressions reveals why the model relies on the assumption of [[independence of irrelevant alternatives]] described above.
Line 105 ⟶ 93:
===Estimating the coefficients===
 
The unknown parameters in each vector '''''β'''<sub>k</sub>'' are typically jointly estimated by [[maximum a posteriori]] (MAP) estimation, which is an extension of [[maximum likelihood]] using [[regularization (mathematics)|regularization]] of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean [[Gaussian distribution|Gaussian]] [[prior distribution]] on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as [[generalized iterative scaling]],<ref>{{Cite journal |title=Generalized iterative scaling for log-linear models |author1=Darroch, J.N. |author2=Ratcliff, D. |name-list-style=amp |journal=The Annals of Mathematical Statistics |volume=43 |issue=5 |pages=1470–1480 |year=1972 |url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177692379 |doi=10.1214/aoms/1177692379|doi-access=free }}</ref> [[iteratively reweighted least squares]] (IRLS),<ref>{{cite book |first=Christopher M. |last=Bishop |year=2006 |title=Pattern Recognition and Machine Learning |publisher=Springer |pages=206–209}}</ref> by means of [[gradient-based optimization]] algorithms such as [[L-BFGS]],<ref name="malouf"/> or by specialized [[coordinate descent]] algorithms.<ref>{{cite journal |first1=Hsiang-Fu |last1=Yu |first2=Fang-Lan |last2=Huang |first3=Chih-Jen |last3=Lin |year=2011 |title=Dual coordinate descent methods for logistic regression and maximum entropy models |journal=Machine Learning |volume=85 |issue=1–2 |pages=41–75 |url=http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf |doi=10.1007/s10994-010-5221-8|doi-access=free }}</ref>
 
===As a log-linear model===
Line 112 ⟶ 100:
 
: <math>
\ln \frac{\Pr(Y_i=K-1k)}{\Pr(Y_i=K)} &= \boldsymbol\beta_{K-1}beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;\;\;1\leq k \le K.
\begin{align}
\ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z \, \\
\ln \Pr(Y_i=2) &= \boldsymbol\beta_2 \cdot \mathbf{X}_i - \ln Z \, \\
\cdots & \cdots \\
\ln \Pr(Y_i=K) &= \boldsymbol\beta_K \cdot \mathbf{X}_i - \ln Z \, \\
\end{align}
</math>
 
Line 127 ⟶ 110:
 
: <math>
\Pr(Y_i=1k) &= {\Pr(Y_i=K)frac{1}{Z} e^{\boldsymbol\beta_1beta_k \cdot \mathbf{X}_i}, \;\;\;\;\;\;1\leq k \le K.
\begin{align}
\Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \, \\
\Pr(Y_i=2) &= \frac{1}{Z} e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \, \\
\cdots & \cdots \\
\Pr(Y_i=K) &= \frac{1}{Z} e^{\boldsymbol\beta_K \cdot \mathbf{X}_i} \, \\
\end{align}
</math>
 
Line 138 ⟶ 116:
 
:<math>
:<math>\Pr(Y_i=K)1 = 1- \sum_{k=1}^{K-1} \Pr (Y_i = k) \;= 1 -\; \sum_{k=1}^{K-} \frac{1}{\Pr(Y_i=K)Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \Rightarrow ;=\Pr(Y_i=K) =; \frac{1}{1 +Z} \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}</math>.
\begin{align}
1 = \sum_{k=1}^{K} \Pr(Y_i=k) &= \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\
&= \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\
\end{align}
</math>
 
Therefore:
 
:<math>Z = \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.</math>
 
Note that this factor is "constant" in the sense that it is not a function of ''Y''<sub>''i''</sub>, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients '''''β'''''<sub>''k''</sub>, which we will need to determine through some sort of [[mathematical optimization|optimization]] procedure.
Line 153 ⟶ 128:
 
:<math>
\Pr(Y_i=K-1k) &= \frac{e^{\boldsymbol\beta_{K-1}beta_k \cdot \mathbf{X}_i}}{1 + \sum_{kj=1}^{K-1} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}}, \;\;\;\;\;\;1\leq k \le K.
\begin{align}
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \, \\
\Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \, \\
\cdots & \cdots \\
\Pr(Y_i=K) &= \frac{e^{\boldsymbol\beta_K \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k
\cdot \mathbf{X}_i}} \, \\
\end{align}
</math>
 
Or generally:
 
:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}</math>
 
The following function:
 
:<math>\operatorname{softmax}(k,x_1s_1,\ldots,x_ns_K) = \frac{e^{x_ks_k}}{\sum_{ij=1}^nK e^{x_is_j}}</math>
 
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>x_1s_1,\ldots,x_ns_K</math> is to exaggerate the differences between them. As a result, <math>\operatorname{softmax}(k,x_1s_1,\ldots,x_ns_K)</math> will return a value close to 0 whenever ''<math>x_ks_k</math>'' is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a [[weighted average]] that behaves as a [[smooth function]] (which can be conveniently [[differentiation (mathematics)|differentiated]], etc.) and which approximates the [[indicator function]]
 
:<math>f(k) = \begin{cases}
1 \;& \textrm{ if } \; k = \operatorname{\arg\max}(x_1,_j \ldots, x_n)s_j, \\
0 \;& \textrm{ otherwise}.
\end{cases}
</math>
Line 180 ⟶ 147:
Thus, we can write the probability equations as
 
:<math>\Pr(Y_i=ck) = \operatorname{softmax}(ck, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \ldots, \boldsymbol\beta_K \cdot \mathbf{X}_i)</math>
 
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
 
Note that not all of the <math>\beta_kboldsymbol{\beta}_k</math> vectors of coefficients are uniquely [[identifiability|identifiable]]. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only <math>kK-1</math> separately specifiable probabilities, and hence <math>kK-1</math> separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical:
 
:<math>
\begin{align}
\frac{e^{(\boldsymbol\beta_cbeta_k + \mathbf{C}) \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{(\boldsymbol\beta_kbeta_j + \mathbf{C}) \cdot \mathbf{X}_i}} &= \frac{e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i} e^{\mathbf{C }\cdot \mathbf{X}_i}} \\
&= \frac{e^{\mathbf{C }\cdot \mathbf{X}_i} e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i} \sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}} \\
&= \frac{e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_k beta_j\cdot \mathbf{X}_i}}
\end{align}
</math>
 
As a result, it is conventional to set <math>\mathbf{C }= -\boldsymbol\beta_K</math> (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes <math>\boldsymbol 0</math>, and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of the ''K'' choices, and examining how much better or worse all of the other ''K''-&nbsp;−&nbsp;1 choices are, relative to the choice we are pivoting around. Mathematically, we transform the coefficients as follows:
 
:<math>
\begin{align}
\boldsymbol\beta'_1_k &= \boldsymbol\beta_1beta_k - \boldsymbol\beta_K, \;\;\;\;1\leq k < K, \\
\boldsymbol\beta'_K &= 0.
\cdots & \cdots \\
\boldsymbol\beta'_{K-1} &= \boldsymbol\beta_{K-1} - \boldsymbol\beta_K \\
\boldsymbol\beta'_K &= 0
\end{align}
</math>
Line 208 ⟶ 173:
 
:<math>
\Pr(Y_i=2k) &= \frac{e^{\boldsymbol\beta_2beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{kj=1}^{K-1} e^{\boldsymbol\beta_kbeta'_j \cdot \mathbf{X}_i}}, \;\;\;\;\;\;1\leq k \le K
\begin{align}
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta'_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\
\cdots & \cdots \\
\Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta'_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\
\Pr(Y_i=K) &= \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\
\end{align}
</math>
 
Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of ''K''-&nbsp;−&nbsp;1 independent two-way regressions.
 
===As a latent-variable model===
Line 222 ⟶ 182:
It is also possible to formulate multinomial logistic regression as a latent variable model, following the [[Logistic regression#Two-way latent-variable model|two-way latent variable model]] described for binary logistic regression. This formulation is common in the theory of [[discrete choice]] models, and makes it easier to compare multinomial logistic regression to the related [[multinomial probit]] model, as well as to extend it to more complex models.
 
Imagine that, for each data point ''i'' and possible outcome ''k''&nbsp;=&nbsp;1,2,...,''K'', there is a continuous [[latent variable]] ''Y''<sub>''i,k''</sub><sup>''*''</sup> (i.e. an unobserved [[random variable]]) that is distributed as follows:
 
: <math>
Y_{i,k}^{\lnast} \Pr(Y_i=1) &= \boldsymbol\beta_1beta_k \cdot \mathbf{X}_i -+ \ln Zvarepsilon_k \;\;\;\;, \;\;k \le K
\begin{align}
Y_{i,1}^{\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \, \\
Y_{i,2}^{\ast} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i + \varepsilon_2 \, \\
\cdots & \\
Y_{i,K}^{\ast} &= \boldsymbol\beta_K \cdot \mathbf{X}_i + \varepsilon_K \, \\
\end{align}
</math>
 
where <math>\varepsilon_k \sim \operatorname{EV}_1(0,1),</math> i.e. a standard type-1 [[extreme value distribution]].
 
This latent variable can be thought of as the [[utility]] associated with data point ''i'' choosing outcome ''k'', where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable <math>Y_i</math> is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome ''k'' is chosen [[if and only if]] the associated utility (the value of <math>Y_{i,k}^{\ast}</math>) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome ''k'' is the maximum of all the utilities. Since the latent variables are [[continuous variable|continuous]], the probability of two having exactly the same value is 0, so we ignore the scenario. That is:
 
: <math>
Line 241 ⟶ 196:
\Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,2}^{\ast} \text{ and } Y_{i,1}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,1}^{\ast} > Y_{i,K}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(Y_{i,2}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,2}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,2}^{\ast} > Y_{i,K}^{\ast}) \\
& \,\,\,\vdots \\
\cdots & \\
\Pr(Y_i = K) &= \Pr(Y_{i,K}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,K}^{\ast} > Y_{i,2}^{\ast}\text{ and } \cdots \text{ and } Y_{i,K}^{\ast} > Y_{i,K-1}^{\ast}) \\
\end{align}
Line 249 ⟶ 204:
 
: <math>
\Pr(Y_i = 1k) &\;=\; \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1k}^{\ast}) \;\;\;\;,\;\;k \le K
\begin{align}
\Pr(Y_i = 1) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,2}^{\ast}) \\
\cdots & \\
\Pr(Y_i = K) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,K}^{\ast}) \\
\end{align}
</math>
 
Line 271 ⟶ 221:
#In general, if <math>X \sim \operatorname{EV}_1(a,b)</math> and <math>Y \sim \operatorname{EV}_1(a,b)</math> then <math>X - Y \sim \operatorname{Logistic}(0,b).</math> That is, the difference of two [[independent identically distributed]] extreme-value-distributed variables follows the [[logistic distribution]], where the first parameter is unimportant. This is understandable since the first parameter is a [[___location parameter]], i.e. it shifts the mean by a fixed amount, and if two values are both shifted by the same amount, their difference remains the same. This means that all of the relational statements underlying the probability of a given choice involve the logistic distribution, which makes the initial choice of the extreme-value distribution, which seemed rather arbitrary, somewhat more understandable.
#The second parameter in an extreme-value or logistic distribution is a [[scale parameter]], such that if <math>X \sim \operatorname{Logistic}(0,1)</math> then <math>bX \sim \operatorname{Logistic}(0,b).</math> This means that the effect of using an error variable with an arbitrary scale parameter in place of scale 1 can be compensated simply by multiplying all regression vectors by the same scale. Together with the previous point, this shows that the use of a standard extreme-value distribution (___location 0, scale 1) for the error variables entails no loss of generality over using an arbitrary extreme-value distribution. In fact, the model is [[nonidentifiable]] (no single set of optimal coefficients) if the more general distribution is used.
#Because only differences of vectors of regression coefficients are used, adding an arbitrary constant to all coefficient vectors has no effect on the model. This means that, just as in the log-linear model, only ''K''-&nbsp;−&nbsp;1 of the coefficient vectors are identifiable, and the last one can be set to an arbitrary value (e.g. 0).
 
Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular [[order statistic]] (the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent.
Line 277 ⟶ 227:
==Estimation of intercept==
When using multinomial logistic regression, one category of the dependent variable is chosen as the reference category. Separate [[odds ratio]]s are determined for all independent variables for each category of the dependent variable with the exception of the reference category, which is omitted from the analysis. The exponential beta coefficient represents the change in the odds of the dependent variable being in a particular category vis-a-vis the reference category, associated with a one unit change of the corresponding independent variable.
 
== Likelihood function ==
The observed values <math>y_i \in \{1,\dots,K\}</math> for <math>i=1,\dots,n</math> of the explained variables are considered as realizations of stochastically independent, [[Categorical distribution|categorically distributed]] random variables <math>Y_1,\dots, Y_n</math>.
 
The [[likelihood function]] for this model is defined by
:<math>L = \prod_{i=1}^n P(Y_i=y_i) = \prod_{i=1}^n \prod_{j=1}^K P(Y_i=j)^{\delta_{j,y_i}},</math>
where the index <math>i</math> denotes the observations 1 to ''n'' and the index <math>j</math> denotes the classes 1 to ''K''. <math>\delta_{j,y_i}=\begin{cases}1, \text{ for } j=y_i \\ 0, \text{ otherwise}\end{cases}</math> is the [[Kronecker delta]].
 
The negative log-likelihood function is therefore the well-known cross-entropy:
:<math>-\log L = - \sum_{i=1}^n \sum_{j=1}^K \delta_{j,y_i} \log(P(Y_i=j))= - \sum_{j=1}^K\sum_{y_i=j}\log(P(Y_i=j)).</math>
 
==Application in natural language processing==
In [[natural language processing]], multinomial LR classifiers are commonly used as an alternative to [[naive Bayes classifier]]s because they do not assume [[statistical independence]] of the random variables (commonly known as ''features'') that serve as predictors. However, learning in such a model is slower than for a naive Bayes classifier, and thus may not be appropriate given a very large number of classes to learn. In particular, learning in a Naivenaive Bayes classifier is a simple matter of counting up the number of co-occurrences of features and classes, while in a maximum entropy classifier the weights, which are typically maximized using [[maximum a posteriori]] (MAP) estimation, must be learned using an iterative procedure; see [[#Estimating the coefficients]].
 
==See also==