Multinomial logistic regression: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Add: isbn, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Whoop whoop pull up | #UCB_webform 361/493
Muhali (talk | contribs)
Line 60:
 
===As a set of independent binary regressions===
To arrive at the multinomial logit model, one can imagine, for ''K'' possible outcomes, running ''K''-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other ''K''-1 outcomes are separately regressed against the pivot outcome. This would proceed as follows, ifIf outcome ''K'' (the last outcome) is chosen as the pivot, the ''K''-1 regression equations are:
 
: <math>
\ln \frac{\Pr(Y_i=1k)}{\Pr(Y_i=K)} &\,=\, \boldsymbol\beta_1beta_k \cdot \mathbf{X}_i \;\;\;\;,\;\;k < K
\begin{align}
</math>.
\ln \frac{\Pr(Y_i=1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i \\
\ln \frac{\Pr(Y_i=2)}{\Pr(Y_i=K)} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i \\
\cdots & \cdots \\
\ln \frac{\Pr(Y_i=K-1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_{K-1} \cdot \mathbf{X}_i \\
\end{align}
</math>
 
This formulation is also known as the [[Compositional_data#Additive_logratio_transform|alr]] transform commonly used in compositional data analysis. If we exponentiate both sides and solve for the probabilities, we get:
Note that we have introduced separate sets of regression coefficients, one for each possible outcome.
 
If we exponentiate both sides, and solve for the probabilities, we get:
 
: <math>
\ln \frac{\Pr(Y_i=2k)} \,=\, {\Pr(Y_i=K)} &= \;e^{\boldsymbol\beta_2beta_k \cdot \mathbf{X}_i} \;\;\;\;,\;\;k < K
\begin{align}
\Pr(Y_i=1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \\
\Pr(Y_i=2) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \\
\cdots & \cdots \\
\Pr(Y_i=K-1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i} \\
\end{align}
</math>
 
Using the fact that all ''K'' of the probabilities must sum to one, we find:
 
:<math>\Pr(Y_i=K) \,=\, 1- \sum_{k=1}^{K-1} \Pr (Y_i = k) \,=\, 1 - \sum_{k=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \,=\, \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}</math>.
 
We can use this to find the other probabilities:
 
:<math>
\Pr(Y_i=1k) &= \frac{e^{\boldsymbol\beta_1beta_k \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \;\;\;\;,\;\;k < K
\begin{align}
</math>.
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\\
\Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\cdots & \cdots \\
\Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\
\end{align}
</math>
 
where the summation runs from <math>1</math> to <math>K-1</math> or generally:
 
<math>
\begin{align}
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_{k} \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}
\end{align}
</math>
 
</math> is defined to be zero. The fact that we run multiple regressions reveals why the model relies on the assumption of [[independence of irrelevant alternatives]] described above.
where <math>
\beta_K
</math> is defined to be zero. The fact that we run multiple regressions reveals why the model relies on the assumption of [[independence of irrelevant alternatives]] described above.
 
===Estimating the coefficients===
 
The unknown parameters in each vector '''''β'''<sub>k</sub>'' are typically jointly estimated by [[maximum a posteriori]] (MAP) estimation, which is an extension of [[maximum likelihood]] using [[regularization (mathematics)|regularization]] of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean [[Gaussian distribution|Gaussian]] [[prior distribution]] on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as [[generalized iterative scaling]],<ref>{{Cite journal |title=Generalized iterative scaling for log-linear models |author1=Darroch, J.N. |author2=Ratcliff, D. |name-list-style=amp |journal=The Annals of Mathematical Statistics |volume=43 |issue=5 |pages=1470–1480 |year=1972 |url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177692379 |doi=10.1214/aoms/1177692379|doi-access=free }}</ref> [[iteratively reweighted least squares]] (IRLS),<ref>{{cite book |first=Christopher M. |last=Bishop |year=2006 |title=Pattern Recognition and Machine Learning |publisher=Springer |pages=206–209}}</ref> by means of [[gradient-based optimization]] algorithms such as [[L-BFGS]],<ref name="malouf"/> or by specialized [[coordinate descent]] algorithms.<ref>{{cite journal |first1=Hsiang-Fu |last1=Yu |first2=Fang-Lan |last2=Huang |first3=Chih-Jen |last3=Lin |year=2011 |title=Dual coordinate descent methods for logistic regression and maximum entropy models |journal=Machine Learning |volume=85 |issue=1–2 |pages=41–75 |url=http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf |doi=10.1007/s10994-010-5221-8|doi-access=free }}</ref>
 
===As a log-linear model===
Line 122 ⟶ 93:
 
: <math>
\ln \frac{\Pr(Y_i=K-1k)}{\Pr(Y_i=K)} &= \boldsymbol\beta_{K-1}beta_k \cdot \mathbf{X}_i - \ln Z \;\;\;\;,\;\;k \le K
\begin{align}
</math>.
\ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z \, \\
\ln \Pr(Y_i=2) &= \boldsymbol\beta_2 \cdot \mathbf{X}_i - \ln Z \, \\
\cdots & \cdots \\
\ln \Pr(Y_i=K) &= \boldsymbol\beta_K \cdot \mathbf{X}_i - \ln Z \, \\
\end{align}
</math>
 
As in the binary case, we need an extra term <math>- \ln Z</math> to ensure that the whole set of probabilities forms a [[probability distribution]], i.e. so that they all sum to one:
Line 137 ⟶ 103:
 
: <math>
\Pr(Y_i=1k) &= {\Pr(Y_i=K)frac{1}{Z} e^{\boldsymbol\beta_1beta_k \cdot \mathbf{X}_i} \;\;\;\;,\;\;k \le K
\begin{align}
</math>.
\Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \, \\
\Pr(Y_i=2) &= \frac{1}{Z} e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \, \\
\cdots & \cdots \\
\Pr(Y_i=K) &= \frac{1}{Z} e^{\boldsymbol\beta_K \cdot \mathbf{X}_i} \, \\
\end{align}
</math>
 
The quantity ''Z'' is called the [[partition function (mathematics)|partition function]] for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1:
 
:<math>
1 = \sum_{k=1}^{K} \Pr(Y_i=k) \;=\; \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_{k}beta_k \cdot \mathbf{X}_i}} \;=\; \frac{1 +}{Z} \sum_{jk=1}^{K-1} e^{\boldsymbol\beta_jbeta_k \cdot \mathbf{X}_i}}
\begin{align}
1 = \sum_{k=1}^{K} \Pr(Y_i=k) &= \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\
&= \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\
\end{align}
</math>
 
Line 163 ⟶ 121:
 
:<math>
\Pr(Y_i=K-1k) &= \frac{e^{\boldsymbol\beta_{K-1}beta_k \cdot \mathbf{X}_i}}{1 + \sum_{kj=1}^{K-1} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}} \;\;\;\;,\;\;k \le K
\begin{align}
where </math>.
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \, \\
\Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \, \\
\cdots & \cdots \\
\Pr(Y_i=K) &= \frac{e^{\boldsymbol\beta_K \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k
\cdot \mathbf{X}_i}} \, \\
\end{align}
</math>
 
Or generally:
 
:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\beta_kbeta_j \cdot \mathbf{X}_i}}</math>
 
The following function:
Line 208 ⟶ 160:
:<math>
\begin{align}
\boldsymbol\beta'_1_k &= \boldsymbol\beta_1beta_k - \boldsymbol\beta_K \;\;\;,\;k < K \\
\cdots & \cdots \\
\boldsymbol\beta'_{K-1} &= \boldsymbol\beta_{K-1} - \boldsymbol\beta_K \\
\boldsymbol\beta'_K &= 0
\end{align}
Line 218 ⟶ 168:
 
:<math>
\Pr(Y_i=2k) &= \frac{e^{\boldsymbol\beta_2beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{kj=1}^{K-1} e^{\boldsymbol\beta_kbeta'_j \cdot \mathbf{X}_i}} \;\;\;\;,\;\;k \le K
\begin{align}
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta'_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\
\cdots & \cdots \\
\Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta'_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\
\Pr(Y_i=K) &= \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\
\end{align}
</math>
 
Line 235 ⟶ 180:
 
: <math>
Y_{i,k}^{\lnast} \Pr(Y_i=1) &= \boldsymbol\beta_1beta_k \cdot \mathbf{X}_i -+ \ln Zvarepsilon_k \;\;\;\;, \;\;k \le K
\begin{align}
Y_{i,1}^{\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \, \\
Y_{i,2}^{\ast} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i + \varepsilon_2 \, \\
\cdots & \\
Y_{i,K}^{\ast} &= \boldsymbol\beta_K \cdot \mathbf{X}_i + \varepsilon_K \, \\
\end{align}
</math>
 
Line 259 ⟶ 199:
 
: <math>
\Pr(Y_i = 1k) &\;=\; \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1k}^{\ast}) \;\;\;\;,\;\;k \le K
\begin{align}
\Pr(Y_i = 1) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,2}^{\ast}) \\
\cdots & \\
\Pr(Y_i = K) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,K}^{\ast}) \\
\end{align}
</math>