Multinomial logistic regression: Difference between revisions

Content deleted Content added
m Likelihood function: Fixed lower limit of k
m As a log-linear model: Use fewer letters (no c, i, n), to improve consistency with other sections.
Line 131:
</math>
 
Or generally:
 
:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_{j=1}^K e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}</math>
 
The following function:
 
:<math>\operatorname{softmax}(k,s_1,\ldots,s_ns_K) = \frac{e^{s_k}}{\sum_{ij=1}^nK e^{s_is_j}}</math>
 
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>s_1,\ldots,s_ns_K</math> is to exaggerate the differences between them. As a result, <math>\operatorname{softmax}(k,s_1,\ldots,s_ns_K)</math> will return a value close to 0 whenever <math>s_k</math> is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a [[weighted average]] that behaves as a [[smooth function]] (which can be conveniently [[differentiation (mathematics)|differentiated]], etc.) and which approximates the [[indicator function]]
 
:<math>f(k) = \begin{cases}
1 & \textrm{if } \; k = \operatorname{\arg\max}_i_j s_i s_j, \\
0 & \textrm{otherwise}.
\end{cases}
Line 149 ⟶ 147:
Thus, we can write the probability equations as
 
:<math>\Pr(Y_i=ck) = \operatorname{softmax}(ck, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \ldots, \boldsymbol\beta_K \cdot \mathbf{X}_i)</math>
 
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
 
Note that not all of the <math>\beta_kboldsymbol{\beta}_k</math> vectors of coefficients are uniquely [[identifiability|identifiable]]. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only <math>K-1</math> separately specifiable probabilities, and hence <math>K-1</math> separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical:
 
:<math>
\begin{align}
\frac{e^{(\boldsymbol\beta_cbeta_k + \mathbf{C}) \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{(\boldsymbol\beta_kbeta_j + \mathbf{C}) \cdot \mathbf{X}_i}} &= \frac{e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i} e^{\mathbf{C }\cdot \mathbf{X}_i}} \\
&= \frac{e^{\mathbf{C }\cdot \mathbf{X}_i} e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i} \sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}} \\
&= \frac{e^{\boldsymbol\beta_cbeta_k \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_k beta_j\cdot \mathbf{X}_i}}
\end{align}
</math>
 
As a result, it is conventional to set <math>\mathbf{C }= -\boldsymbol\beta_K</math> (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes <math>\boldsymbol 0</math>, and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of the ''K'' choices, and examining how much better or worse all of the other ''K''&nbsp;−&nbsp;1 choices are, relative to the choice we are pivoting around. Mathematically, we transform the coefficients as follows:
 
:<math>
\begin{align}
\boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;\;1\leq k < K, \\
\boldsymbol\beta'_K &= 0.
\end{align}
</math>