Multinomial logistic regression: Difference between revisions

Content deleted Content added
corrections per WP:MOSMATH
Link suggestions feature: 3 links added.
 
(12 intermediate revisions by 2 users not shown)
Line 42:
 
====Data points====
Specifically, it is assumed that we have a series of ''N'' observed data points. Each data point ''i'' (ranging from 1 to ''N'') consists of a set of ''M'' explanatory variables ''x''<sub>''1,''i''</sub> ... ''x''<sub>''M,i''</sub> (also known as [[independent variable]]s, predictor variables, features, etc.), and an associated [[categorical variable|categorical]] outcome ''Y''<sub>''i''</sub> (also known as [[dependent variable]], response variable), which can take on one of ''K'' possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to ''K''. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of ''N'' "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so that the outcome of a new "experiment" can be correctly predicted for a new data point for which the explanatory variables, but not the outcome, are available. In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome.
 
Some examples:
Line 57:
:<math>f(k,i) = \boldsymbol\beta_k \cdot \mathbf{x}_i,</math>
 
where <math>\boldsymbol\beta_k</math> is the set of regression coefficients associated with outcome ''k'', and <math>\mathbf{x}_i</math> (a row vector) is the set of explanatory variables associated with observation ''i'', prepended by a 1 in entry 0.
 
===As a set of independent binary regressions===
Line 63:
 
: <math>
\ln \frac{\Pr(Y_i=k)}{\Pr(Y_i=K)} \,=\, \boldsymbol\beta_k \cdot \mathbf{X}_i, \;\;\;\;,\;\;1\leq k < K
</math>.
 
Line 71:
 
: <math>
\Pr(Y_i=k) \,=\, {\Pr(Y_i=K)}\;e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;,\;\;1\leq k < K
</math>
 
Using the fact that all ''K'' of the probabilities must sum to one, we find:
 
:<math>
:<math>\Pr(Y_i=K) \,=\, 1- \sum_{j=1}^{K-1} \Pr (Y_i = j) \,=\, 1 - \sum_{j=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \,=\, \frac{1}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>.
\begin{align}
:<math>\Pr(Y_i=K) \,=\, {} & 1- \sum_{j=1}^{K-1} \Pr (Y_i = j) \,=\, = {} & 1 - \sum_{j=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \,=\, \frac{1}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>.
:<math>\Pr(Y_i=c) ={} & \frac{e^1}{\boldsymbol\beta_c1 \cdot+ \mathbf{X}_i}}{\sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>.
\end{align}
</math>
 
We can use this to find the other probabilities:
 
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k < K
</math>.
 
Line 95 ⟶ 100:
 
: <math>
\ln \Pr(Y_i=k) = \boldsymbol\beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;,\;\;1\leq k \le K.
</math>.
 
As in the binary case, we need an extra term <math>- \ln Z</math> to ensure that the whole set of probabilities forms a [[probability distribution]], i.e. so that they all sum to one:
Line 105 ⟶ 110:
 
: <math>
\Pr(Y_i=k) = \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;,\;\;1\leq k \le K.
</math>.
 
The quantity ''Z'' is called the [[partition function (mathematics)|partition function]] for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1:
 
:<math>
1 = \sum_{k=1}^{K} \Pr(Y_i=k) \;=\; \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \;=\; \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.
</math>
 
Therefore:
 
:<math>Z = \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.</math>
 
Note that this factor is "constant" in the sense that it is not a function of ''Y''<sub>''i''</sub>, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients '''''β'''''<sub>''k''</sub>, which we will need to determine through some sort of [[mathematical optimization|optimization]] procedure.
Line 123 ⟶ 128:
 
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k \le K.
</math>.
 
Or generally:
 
:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>
 
The following function:
 
:<math>\operatorname{softmax}(k,x_1s_1,\ldots,x_ns_K) = \frac{e^{x_ks_k}}{\sum_{ij=1}^nK e^{x_is_j}}</math>
 
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>x_1s_1,\ldots,x_ns_K</math> is to exaggerate the differences between them. As a result, <math>\operatorname{softmax}(k,x_1s_1,\ldots,x_ns_K)</math> will return a value close to 0 whenever <math>x_ks_k</math> is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a [[weighted average]] that behaves as a [[smooth function]] (which can be conveniently [[differentiation (mathematics)|differentiated]], etc.) and which approximates the [[indicator function]]
 
:<math>f(k) = \begin{cases}
1 \;& \textrm{ if } \; k = \operatorname{\arg\max}(x_1,_j \ldots, x_n)s_j, \\
0 \;& \textrm{ otherwise}.
\end{cases}
</math>
Line 144 ⟶ 147:
Thus, we can write the probability equations as
 
:<math>\Pr(Y_i=ck) = \operatorname{softmax}(ck, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \ldots, \boldsymbol\beta_K \cdot \mathbf{X}_i)</math>
 
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
 
Note that not all of the <math>\beta_kboldsymbol{\beta}_k</math> vectors of coefficients are uniquely [[identifiability|identifiable]]. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only <math>kK-1</math> separately specifiable probabilities, and hence <math>kK-1</math> separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical:
 
:<math>
\begin{align}
\frac{e^{(\boldsymbol\beta_cbeta_k + \mathbf{C}) \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{(\boldsymbol\beta_kbeta_j + \mathbf{C}) \cdot \mathbf{X}_i}} &= \frac{e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i} e^{\mathbf{C }\cdot \mathbf{X}_i}} \\
&= \frac{e^{\mathbf{C }\cdot \mathbf{X}_i} e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i} \sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}} \\
&= \frac{e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_k beta_j\cdot \mathbf{X}_i}}
\end{align}
</math>
 
As a result, it is conventional to set <math>\mathbf{C }= -\boldsymbol\beta_K</math> (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes <math>\boldsymbol 0</math>, and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of the ''K'' choices, and examining how much better or worse all of the other ''K''&nbsp;−&nbsp;1 choices are, relative to the choice we are pivoting around. Mathematically, we transform the coefficients as follows:
 
:<math>
\begin{align}
\boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;,\;1\leq k < K, \\
\boldsymbol\beta'_K &= 0.
\end{align}
</math>
Line 170 ⟶ 173:
 
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta'_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k \le K
</math>
 
Line 187 ⟶ 190:
where <math>\varepsilon_k \sim \operatorname{EV}_1(0,1),</math> i.e. a standard type-1 [[extreme value distribution]].
 
This latent variable can be thought of as the [[utility]] associated with data point ''i'' choosing outcome ''k'', where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable <math>Y_i</math> is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome ''k'' is chosen [[if and only if]] the associated utility (the value of <math>Y_{i,k}^{\ast}</math>) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome ''k'' is the maximum of all the utilities. Since the latent variables are [[continuous variable|continuous]], the probability of two having exactly the same value is 0, so we ignore the scenario. That is:
 
: <math>
Line 193 ⟶ 196:
\Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,2}^{\ast} \text{ and } Y_{i,1}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,1}^{\ast} > Y_{i,K}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(Y_{i,2}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,2}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,2}^{\ast} > Y_{i,K}^{\ast}) \\
& \,\,\,\vdots \\
\cdots & \\
\Pr(Y_i = K) &= \Pr(Y_{i,K}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,K}^{\ast} > Y_{i,2}^{\ast}\text{ and } \cdots \text{ and } Y_{i,K}^{\ast} > Y_{i,K-1}^{\ast}) \\
\end{align}
Line 226 ⟶ 229:
 
== Likelihood function ==
The observed values <math>y_i \in \{0,1,\dots ,K\}</math> for <math>i=1,\dots,n</math> of the explained variables are considered as realizations of stochastically independent, [[Categorical distribution|categorically distributed]] random variables <math>Y_1,\dots, Y_n</math>.
 
The [[likelihood function]] for this model is defined by:
:<math>L = \prod_{i=1}^n P(Y_i=y_i) = \prod_{i=1}^n \left( \prod_{j=1}^K P(Y_i=j)^{\delta_{j,y_i}} \right) ,</math>
where the index <math>i</math> denotes the observations 1 to ''n'' and the index <math>j</math> denotes the classes 1 to ''K''. <math>\delta_{j,y_i}=\begin{cases}1, \text{ for } j=y_i \\ 0, \text{ otherwise}\end{cases}</math> is the [[Kronecker delta]].
 
The negative log-likelihood function is therefore the well-known cross-entropy:
:<math>-\log L = - \sum_{i=1}^n \sum_{j=1}^K \delta_{j,y_i} \log(P(Y_i=j))= - \sum_{j=1}^K\sum_{y_i=j}\log(P(Y_i=j)).</math>
 
==Application in natural language processing==