Content deleted Content added
Greatgalaxy (talk | contribs) Link suggestions feature: 3 links added. |
|||
(19 intermediate revisions by 6 users not shown) | |||
Line 20:
The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to be [[statistically independent]] from each other (unlike, for example, in a [[naive Bayes classifier]]); however, [[multicollinearity|collinearity]] is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case.<ref>{{cite book | last = Belsley | first = David | title = Conditioning diagnostics : collinearity and weak data in regression | publisher = Wiley | ___location = New York | year = 1991 | isbn = 9780471528890 }}</ref>
If the multinomial logit is used to model choices, it relies on the assumption of [[independence of irrelevant alternatives]] (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of ''K'' alternatives to be modeled as a set of ''K''
If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the [[nested logit]] or the [[multinomial probit]] may be used in such cases as they allow for violation of the IIA.<ref>{{cite journal |last1=Baltas |first1=G. |last2=Doyle |first2=P. |year=2001 |title=Random Utility Models in Marketing Research: A Survey |journal=[[Journal of Business Research]] |volume=51 |issue=2 |pages=115–125 |doi=10.1016/S0148-2963(99)00058-2 }}</ref>
Line 42:
====Data points====
Specifically, it is assumed that we have a series of ''N'' observed data points. Each data point ''i'' (ranging from
Some examples:
Line 53:
:<math>f(k,i) = \beta_{0,k} + \beta_{1,k} x_{1,i} + \beta_{2,k} x_{2,i} + \cdots + \beta_{M,k} x_{M,i},</math>
where <math>\beta_{m,k}</math> is a [[regression coefficient]] associated with the ''m''th explanatory variable and the ''k''th outcome. As explained in the [[logistic regression]] article, the regression coefficients and explanatory variables are normally grouped into vectors of size ''M
:<math>f(k,i) = \boldsymbol\beta_k \cdot \mathbf{x}_i,</math>
where <math>\boldsymbol\beta_k</math> is the set of regression coefficients associated with outcome ''k'', and <math>\mathbf{x}_i</math> (a row vector) is the set of explanatory variables associated with observation ''i'', prepended by a 1 in entry 0.
===As a set of independent binary regressions===
To arrive at the multinomial logit model, one can imagine, for ''K'' possible outcomes, running ''K''
: <math>
\ln \frac{\Pr(Y_i=k)}{\Pr(Y_i=K)} \,=\, \boldsymbol\beta_k \cdot \mathbf{X}_i, \;\;\;\;
</math>.
This formulation is also known as the [[Compositional_data#
If we exponentiate both sides and solve for the probabilities, we get:
: <math>
\Pr(Y_i=k) \,=\, {\Pr(Y_i=K)}\;e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;
</math>
Using the fact that all ''K'' of the probabilities must sum to one, we find:
:<math>
:<math>\Pr(Y_i=K) \,=\, 1- \sum_{j=1}^{K-1} \Pr (Y_i = j) \,=\, 1 - \sum_{j=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \,=\, \frac{1}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>.▼
\begin{align}
▲
\end{align}
</math>
We can use this to find the other probabilities:
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;
</math>.
Line 95 ⟶ 100:
: <math>
\ln \Pr(Y_i=k) = \boldsymbol\beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;
</math>
As in the binary case, we need an extra term <math>- \ln Z</math> to ensure that the whole set of probabilities forms a [[probability distribution]], i.e. so that they all sum to one:
Line 105 ⟶ 110:
: <math>
\Pr(Y_i=k) = \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;
</math>
The quantity ''Z'' is called the [[partition function (mathematics)|partition function]] for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1:
:<math>
1 = \sum_{k=1}^{K} \Pr(Y_i=k) \;=\; \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \;=\; \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.
</math>
Therefore
:<math>Z = \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.</math>
Note that this factor is "constant" in the sense that it is not a function of ''Y''<sub>''i''</sub>, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients '''''β'''''<sub>''k''</sub>, which we will need to determine through some sort of [[mathematical optimization|optimization]] procedure.
Line 123 ⟶ 128:
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;
</math>
▲:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>
The following function:
:<math>\operatorname{softmax}(k,
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>
:<math>f(k) = \begin{cases}
1
0
\end{cases}
</math>
Line 144 ⟶ 147:
Thus, we can write the probability equations as
:<math>\Pr(Y_i=
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
Note that not all of the <math>\
:<math>
\begin{align}
\frac{e^{(\boldsymbol\
&= \frac{e^{\mathbf{C
&= \frac{e^{\boldsymbol\
\end{align}
</math>
As a result, it is conventional to set <math>\mathbf{C
:<math>
\begin{align}
\boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;
\boldsymbol\beta'_K &= 0.
\end{align}
</math>
Line 170 ⟶ 173:
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta'_j \cdot \mathbf{X}_i}}, \;\;\;\;
</math>
Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of ''K''
===As a latent-variable model===
Line 179 ⟶ 182:
It is also possible to formulate multinomial logistic regression as a latent variable model, following the [[Logistic regression#Two-way latent-variable model|two-way latent variable model]] described for binary logistic regression. This formulation is common in the theory of [[discrete choice]] models, and makes it easier to compare multinomial logistic regression to the related [[multinomial probit]] model, as well as to extend it to more complex models.
Imagine that, for each data point ''i'' and possible outcome ''k'' = 1,2,...,''K'', there is a continuous [[latent variable]] ''Y''<sub>''i,k''</sub><sup>''*''</sup> (i.e. an unobserved [[random variable]]) that is distributed as follows:
: <math>
Line 187 ⟶ 190:
where <math>\varepsilon_k \sim \operatorname{EV}_1(0,1),</math> i.e. a standard type-1 [[extreme value distribution]].
This latent variable can be thought of as the [[utility]] associated with data point ''i'' choosing outcome ''k'', where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable <math>Y_i</math> is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome ''k'' is chosen [[if and only if]] the associated utility (the value of <math>Y_{i,k}^{\ast}</math>) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome ''k'' is the maximum of all the utilities. Since the latent variables are [[continuous variable|continuous]], the probability of two having exactly the same value is 0, so we ignore the scenario. That is:
: <math>
Line 193 ⟶ 196:
\Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,2}^{\ast} \text{ and } Y_{i,1}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,1}^{\ast} > Y_{i,K}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(Y_{i,2}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,2}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,2}^{\ast} > Y_{i,K}^{\ast}) \\
& \,\,\,\vdots \\
\Pr(Y_i = K) &= \Pr(Y_{i,K}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,K}^{\ast} > Y_{i,2}^{\ast}\text{ and } \cdots \text{ and } Y_{i,K}^{\ast} > Y_{i,K-1}^{\ast}) \\
\end{align}
Line 218 ⟶ 221:
#In general, if <math>X \sim \operatorname{EV}_1(a,b)</math> and <math>Y \sim \operatorname{EV}_1(a,b)</math> then <math>X - Y \sim \operatorname{Logistic}(0,b).</math> That is, the difference of two [[independent identically distributed]] extreme-value-distributed variables follows the [[logistic distribution]], where the first parameter is unimportant. This is understandable since the first parameter is a [[___location parameter]], i.e. it shifts the mean by a fixed amount, and if two values are both shifted by the same amount, their difference remains the same. This means that all of the relational statements underlying the probability of a given choice involve the logistic distribution, which makes the initial choice of the extreme-value distribution, which seemed rather arbitrary, somewhat more understandable.
#The second parameter in an extreme-value or logistic distribution is a [[scale parameter]], such that if <math>X \sim \operatorname{Logistic}(0,1)</math> then <math>bX \sim \operatorname{Logistic}(0,b).</math> This means that the effect of using an error variable with an arbitrary scale parameter in place of scale 1 can be compensated simply by multiplying all regression vectors by the same scale. Together with the previous point, this shows that the use of a standard extreme-value distribution (___location 0, scale 1) for the error variables entails no loss of generality over using an arbitrary extreme-value distribution. In fact, the model is [[nonidentifiable]] (no single set of optimal coefficients) if the more general distribution is used.
#Because only differences of vectors of regression coefficients are used, adding an arbitrary constant to all coefficient vectors has no effect on the model. This means that, just as in the log-linear model, only ''K''
Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular [[order statistic]] (the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent.
Line 224 ⟶ 227:
==Estimation of intercept==
When using multinomial logistic regression, one category of the dependent variable is chosen as the reference category. Separate [[odds ratio]]s are determined for all independent variables for each category of the dependent variable with the exception of the reference category, which is omitted from the analysis. The exponential beta coefficient represents the change in the odds of the dependent variable being in a particular category vis-a-vis the reference category, associated with a one unit change of the corresponding independent variable.
== Likelihood function ==
The observed values <math>y_i \in \{1,\dots,K\}</math> for <math>i=1,\dots,n</math> of the explained variables are considered as realizations of stochastically independent, [[Categorical distribution|categorically distributed]] random variables <math>Y_1,\dots, Y_n</math>.
The [[likelihood function]] for this model is defined by
:<math>L = \prod_{i=1}^n P(Y_i=y_i) = \prod_{i=1}^n \prod_{j=1}^K P(Y_i=j)^{\delta_{j,y_i}},</math>
where the index <math>i</math> denotes the observations 1 to ''n'' and the index <math>j</math> denotes the classes 1 to ''K''. <math>\delta_{j,y_i}=\begin{cases}1, \text{ for } j=y_i \\ 0, \text{ otherwise}\end{cases}</math> is the [[Kronecker delta]].
The negative log-likelihood function is therefore the well-known cross-entropy:
:<math>-\log L = - \sum_{i=1}^n \sum_{j=1}^K \delta_{j,y_i} \log(P(Y_i=j))= - \sum_{j=1}^K\sum_{y_i=j}\log(P(Y_i=j)).</math>
==Application in natural language processing==
In [[natural language processing]], multinomial LR classifiers are commonly used as an alternative to [[naive Bayes classifier]]s because they do not assume [[statistical independence]] of the random variables (commonly known as ''features'') that serve as predictors. However, learning in such a model is slower than for a naive Bayes classifier, and thus may not be appropriate given a very large number of classes to learn. In particular, learning in a
==See also==
|