Revision as of 03:36, 16 November 2022 edit Citation bot (talk \| contribs) Bots 5,865,312 edits Add: isbn, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Whoop whoop pull up \| #UCB_webform 361/493 ← Previous edit		Revision as of 18:10, 17 February 2023 edit undo Muhali (talk \| contribs) 354 edits →As a set of independent binary regressions: index equations simplified Tag: Visual edit Next edit →
Line 60: ===As a set of independent binary regressions=== To arrive at the multinomial logit model, one can imagine, for ''K'' possible outcomes, running ''K''-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other ''K''-1 outcomes are separately regressed against the pivot outcome. ~~This would proceed as follows, if~~If outcome ''K'' (the last outcome) is chosen as the pivot, the ''K''-1 regression equations are: : <math> \ln \frac{\Pr(Y_i=1k)}{\Pr(Y_i=K)} &\,=\, \boldsymbol\~~beta_1~~beta_k \cdot \mathbf{X}_i \;\;\;\;,\;\;k < K▼ ~~\begin{align}~~ </math>.▼ ▲\ln \frac{\Pr(Y_i=1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i \\ \ln \frac{\Pr(Y_i=2)}{\Pr(Y_i=K)} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i \\▼ ~~\cdots & \cdots \\~~ \ln \frac{\Pr(Y_i=K-1)}{\Pr(Y_i=K)} &= \boldsymbol\beta_{K-1} \cdot \mathbf{X}_i \\▼ ~~\end{align}~~ ▲</math> This formulation is also known as the [[Compositional_data#Additive_logratio_transform\|alr]] transform commonly used in compositional data analysis. If we exponentiate both sides and solve for the probabilities, we get: ~~Note that we have introduced separate sets of regression coefficients, one for each possible outcome.~~ ~~If we exponentiate both sides, and solve for the probabilities, we get:~~ : <math> ▲~~\ln \frac{~~\Pr(Y_i=2k)} \,=\, {\Pr(Y_i=K)} &= \;e^{\boldsymbol\~~beta_2~~beta_k \cdot \mathbf{X}_i} \;\;\;\;,\;\;k < K ~~\begin{align}~~ \Pr(Y_i=1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \\▼ ~~\Pr(Y_i=2) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \\~~ ~~\cdots & \cdots \\~~ ~~\Pr(Y_i=K-1) &= {\Pr(Y_i=K)}e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i} \\~~ ~~\end{align}~~ </math> Using the fact that all ''K'' of the probabilities must sum to one, we find: :<math>\Pr(Y_i=K) \,=\, 1- \sum_{k=1}^{K-1} \Pr (Y_i = k) \,=\, 1 - \sum_{k=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \,=\, \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}</math>. We can use this to find the other probabilities: :<math> \Pr(Y_i=1k) &= \frac{e^{\boldsymbol\~~beta_1~~beta_k \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \;\;\;\;,\;\;k < K▼ ~~\begin{align}~~ </math>.▼ ▲\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\ \\ \Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\▼ ~~\cdots & \cdots \\~~ \Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \\▼ ~~\end{align}~~ ▲</math> ~~where the summation runs from <math>1</math> to <math>K-1</math> or generally:~~ <math>▼ ~~\begin{align}~~ \Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_{k} \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}▼ ~~\end{align}~~ </math>▼ ~~</math> is defined to be zero.~~ The fact that we run multiple regressions reveals why the model relies on the assumption of [[independence of irrelevant alternatives]] described above.▼ where <math>▼ ~~\beta_K~~ ▲</math> is defined to be zero. The fact that we run multiple regressions reveals why the model relies on the assumption of [[independence of irrelevant alternatives]] described above. ===Estimating the coefficients=== The unknown parameters in each vector '''''β'''<sub>k</sub>'' are typically jointly estimated by [[maximum a posteriori]] (MAP) estimation, which is an extension of [[maximum likelihood]] using [[regularization (mathematics)\|regularization]] of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean [[Gaussian distribution\|Gaussian]] [[prior distribution]] on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as [[generalized iterative scaling]],<ref>{{Cite journal \|title=Generalized iterative scaling for log-linear models \|author1=Darroch, J.N. \|author2=Ratcliff, D. \|name-list-style=amp \|journal=The Annals of Mathematical Statistics \|volume=43 \|issue=5 \|pages=1470–1480 \|year=1972 \|url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177692379 \|doi=10.1214/aoms/1177692379\|doi-access=free }}</ref> [[iteratively reweighted least squares]] (IRLS),<ref>{{cite book \|first=Christopher M. \|last=Bishop \|year=2006 \|title=Pattern Recognition and Machine Learning \|publisher=Springer \|pages=206–209}}</ref> by means of [[gradient-based optimization]] algorithms such as [[L-BFGS]],<ref name="malouf"/> or by specialized [[coordinate descent]] algorithms.<ref>{{cite journal \|first1=Hsiang-Fu \|last1=Yu \|first2=Fang-Lan \|last2=Huang \|first3=Chih-Jen \|last3=Lin \|year=2011 \|title=Dual coordinate descent methods for logistic regression and maximum entropy models \|journal=Machine Learning \|volume=85 \|issue=1–2 \|pages=41–75 \|url=http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf \|doi=10.1007/s10994-010-5221-8\|doi-access=free }}</ref> ===As a log-linear model=== Line 122 ⟶ 93: : <math> ▲\ln ~~\frac{~~\Pr(Y_i=~~K-1~~k)~~}{\Pr(Y_i=K)}~~ &= \boldsymbol\~~beta_{K-1}~~beta_k \cdot \mathbf{X}_i - \ln Z \;\;\;\;,\;\;k \le K ~~\begin{align}~~ ▲</math>. \ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z \, \\▼ ~~\ln \Pr(Y_i=2) &= \boldsymbol\beta_2 \cdot \mathbf{X}_i - \ln Z \, \\~~ ~~\cdots & \cdots \\~~ ~~\ln \Pr(Y_i=K) &= \boldsymbol\beta_K \cdot \mathbf{X}_i - \ln Z \, \\~~ ~~\end{align}~~ ~~</math>~~ As in the binary case, we need an extra term <math>- \ln Z</math> to ensure that the whole set of probabilities forms a [[probability distribution]], i.e. so that they all sum to one: Line 137 ⟶ 103: : <math> ▲\Pr(Y_i=1k) &= {\~~Pr(Y_i=K)~~frac{1}{Z} e^{\boldsymbol\~~beta_1~~beta_k \cdot \mathbf{X}_i} \;\;\;\;,\;\;k \le K ~~\begin{align}~~ ▲</math>. ~~\Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} \, \\~~ ~~\Pr(Y_i=2) &= \frac{1}{Z} e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i} \, \\~~ ~~\cdots & \cdots \\~~ ~~\Pr(Y_i=K) &= \frac{1}{Z} e^{\boldsymbol\beta_K \cdot \mathbf{X}_i} \, \\~~ ~~\end{align}~~ ~~</math>~~ The quantity ''Z'' is called the [[partition function (mathematics)\|partition function]] for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1: :<math> ▲1 = \sum_{k=1}^{K} \Pr(Y_i=k) \;=\; \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\~~beta_{k}~~beta_k \cdot \mathbf{X}_i}} \;=\; \frac{1 +}{Z} \sum_{jk=1}^{K-1} e^{\boldsymbol\~~beta_j~~beta_k \cdot \mathbf{X}_i}} ~~\begin{align}~~ ~~1 = \sum_{k=1}^{K} \Pr(Y_i=k) &= \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\~~ ~~&= \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \\~~ ~~\end{align}~~ </math> Line 163 ⟶ 121: :<math> ▲\Pr(Y_i=~~K-1~~k) &= \frac{e^{\boldsymbol\~~beta_{K-1}~~beta_k \cdot \mathbf{X}_i}}{~~1 +~~ \sum_{kj=1}^{K-1} e^{\boldsymbol\~~beta_k~~beta_j \cdot \mathbf{X}_i}} \;\;\;\;,\;\;k \le K ~~\begin{align}~~ ▲~~where~~ </math>. ~~\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \, \\~~ ~~\Pr(Y_i=2) &= \frac{e^{\boldsymbol\beta_2 \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} \, \\~~ ~~\cdots & \cdots \\~~ ~~\Pr(Y_i=K) &= \frac{e^{\boldsymbol\beta_K \cdot \mathbf{X}_i}}{\sum_{k=1}^{K} e^{\boldsymbol\beta_k~~ ~~\cdot \mathbf{X}_i}} \, \\~~ ~~\end{align}~~ ~~</math>~~ Or generally: :<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_{kj=1}^{K} e^{\boldsymbol\~~beta_k~~beta_j \cdot \mathbf{X}_i}}</math> The following function: Line 208 ⟶ 160: :<math> \begin{align} \boldsymbol\beta'_1_k &= \boldsymbol\~~beta_1~~beta_k - \boldsymbol\beta_K \;\;\;,\;k < K \\ ~~\cdots & \cdots \\~~ ~~\boldsymbol\beta'_{K-1} &= \boldsymbol\beta_{K-1} - \boldsymbol\beta_K \\~~ \boldsymbol\beta'_K &= 0 \end{align} Line 218 ⟶ 168: :<math> ▲\Pr(Y_i=2k) &= \frac{e^{\boldsymbol\~~beta_2~~beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{kj=1}^{K-1} e^{\boldsymbol\~~beta_k~~beta'_j \cdot \mathbf{X}_i}} \;\;\;\;,\;\;k \le K ~~\begin{align}~~ ~~\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta'_1 \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\~~ ~~\cdots & \cdots \\~~ ~~\Pr(Y_i=K-1) &= \frac{e^{\boldsymbol\beta'_{K-1} \cdot \mathbf{X}_i}}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\~~ ~~\Pr(Y_i=K) &= \frac{1}{1 + \sum_{k=1}^{K-1} e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}} \, \\~~ ~~\end{align}~~ </math> Line 235 ⟶ 180: : <math> ▲Y_{i,k}^{\lnast} ~~\Pr(Y_i=1) &~~= \boldsymbol\~~beta_1~~beta_k \cdot \mathbf{X}_i -+ \~~ln Z~~varepsilon_k \;\;\;\;, \;\;k \le K ~~\begin{align}~~ ~~Y_{i,1}^{\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \, \\~~ ~~Y_{i,2}^{\ast} &= \boldsymbol\beta_2 \cdot \mathbf{X}_i + \varepsilon_2 \, \\~~ ~~\cdots & \\~~ ~~Y_{i,K}^{\ast} &= \boldsymbol\beta_K \cdot \mathbf{X}_i + \varepsilon_K \, \\~~ ~~\end{align}~~ </math> Line 259 ⟶ 199: : <math> \Pr(Y_i = 1k) &\;=\; \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1k}^{\ast}) \;\;\;\;,\;\;k \le K▼ ~~\begin{align}~~ ▲\Pr(Y_i = 1) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,1}^{\ast}) \\ ~~\Pr(Y_i = 2) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,2}^{\ast}) \\~~ ~~\cdots & \\~~ ~~\Pr(Y_i = K) &= \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,K}^{\ast}) \\~~ ~~\end{align}~~ </math>

Multinomial logistic regression: Difference between revisions