Revision as of 00:54, 18 October 2024 edit Aimefournier (talk \| contribs) 100 edits m →As a set of independent binary regressions: Clarify that k = 0 is not included. ← Previous edit		Revision as of 01:10, 18 October 2024 edit undo Aimefournier (talk \| contribs) 100 edits m →As a log-linear model: Clarify that k = 0 is not a case, and prevent confusion about x symbol. Next edit →
Line 100: : <math> \ln \Pr(Y_i=k) = \boldsymbol\beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;,\;\;1\leq k \le K. </math> Line 110: : <math> \Pr(Y_i=k) = \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;,\;\;1\leq k \le K. </math> Line 128: :<math> \Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k \le K. </math> Line 137: The following function: :<math>\operatorname{softmax}(k,~~x_1~~s_1,\ldots,~~x_n~~s_n) = \frac{e^{~~x_k~~s_k}}{\sum_{i=1}^n e^{~~x_i~~s_i}}</math> is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>~~x_1~~s_1,\ldots,~~x_n~~s_n</math> is to exaggerate the differences between them. As a result, <math>\operatorname{softmax}(k,~~x_1~~s_1,\ldots,~~x_n~~s_n)</math> will return a value close to 0 whenever <math>~~x_k~~s_k</math> is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a [[weighted average]] that behaves as a [[smooth function]] (which can be conveniently [[differentiation (mathematics)\|differentiated]], etc.) and which approximates the [[indicator function]] :<math>f(k) = \begin{cases} 1 & \textrm{if } \; k = \operatorname{\arg\max}~~(x_1,~~_i ~~\ldots, x_n)~~s_i, \\ 0 & \textrm{otherwise}. \end{cases} Line 153: The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression. Note that not all of the <math>\beta_k</math> vectors of coefficients are uniquely [[identifiability\|identifiable]]. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only <math>kK-1</math> separately specifiable probabilities, and hence <math>kK-1</math> separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical: :<math> Line 167: :<math> \begin{align} \boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;,\;1\leq k < K \\ \boldsymbol\beta'_K &= 0 \end{align} Line 175: :<math> \Pr(Y_i=k) = \frac{e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta'_j \cdot \mathbf{X}_i}}, \;\;\;\;,\;\;1\leq k \le K </math>

Multinomial logistic regression: Difference between revisions