Multinomial logistic regression: Difference between revisions

Content deleted Content added
m As a set of independent binary regressions: Clarify that k = 0 is not included.
m As a log-linear model: Clarify that k = 0 is not a case, and prevent confusion about x symbol.
Line 100:
 
: <math>
\ln \Pr(Y_i=k) = \boldsymbol\beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;,\;\;1\leq k \le K.
</math>
 
Line 110:
 
: <math>
\Pr(Y_i=k) = \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;,\;\;1\leq k \le K.
</math>
 
Line 128:
 
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k \le K.
</math>
 
Line 137:
The following function:
 
:<math>\operatorname{softmax}(k,x_1s_1,\ldots,x_ns_n) = \frac{e^{x_ks_k}}{\sum_{i=1}^n e^{x_is_i}}</math>
 
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>x_1s_1,\ldots,x_ns_n</math> is to exaggerate the differences between them. As a result, <math>\operatorname{softmax}(k,x_1s_1,\ldots,x_ns_n)</math> will return a value close to 0 whenever <math>x_ks_k</math> is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a [[weighted average]] that behaves as a [[smooth function]] (which can be conveniently [[differentiation (mathematics)|differentiated]], etc.) and which approximates the [[indicator function]]
 
:<math>f(k) = \begin{cases}
1 & \textrm{if } \; k = \operatorname{\arg\max}(x_1,_i \ldots, x_n)s_i, \\
0 & \textrm{otherwise}.
\end{cases}
Line 153:
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
 
Note that not all of the <math>\beta_k</math> vectors of coefficients are uniquely [[identifiability|identifiable]]. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only <math>kK-1</math> separately specifiable probabilities, and hence <math>kK-1</math> separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical:
 
:<math>
Line 167:
:<math>
\begin{align}
\boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;,\;1\leq k < K \\
\boldsymbol\beta'_K &= 0
\end{align}
Line 175:
 
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta'_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k \le K
</math>