Content deleted Content added
Greatgalaxy (talk | contribs) Link suggestions feature: 3 links added. |
|||
(10 intermediate revisions by 2 users not shown) | |||
Line 57:
:<math>f(k,i) = \boldsymbol\beta_k \cdot \mathbf{x}_i,</math>
where <math>\boldsymbol\beta_k</math> is the set of regression coefficients associated with outcome ''k'', and <math>\mathbf{x}_i</math> (a row vector) is the set of explanatory variables associated with observation ''i'', prepended by a 1 in entry 0.
===As a set of independent binary regressions===
Line 63:
: <math>
\ln \frac{\Pr(Y_i=k)}{\Pr(Y_i=K)} \,=\, \boldsymbol\beta_k \cdot \mathbf{X}_i, \;\;\;\;
</math>.
Line 71:
: <math>
\Pr(Y_i=k) \,=\, {\Pr(Y_i=K)}\;e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;
</math>
Line 86:
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;
</math>.
Line 100:
: <math>
\ln \Pr(Y_i=k) = \boldsymbol\beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;
</math>
As in the binary case, we need an extra term <math>- \ln Z</math> to ensure that the whole set of probabilities forms a [[probability distribution]], i.e. so that they all sum to one:
Line 110:
: <math>
\Pr(Y_i=k) = \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;
</math>
The quantity ''Z'' is called the [[partition function (mathematics)|partition function]] for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1:
:<math>
1 = \sum_{k=1}^{K} \Pr(Y_i=k) \;=\; \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \;=\; \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.
</math>
Therefore
:<math>Z = \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}.</math>
Note that this factor is "constant" in the sense that it is not a function of ''Y''<sub>''i''</sub>, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients '''''β'''''<sub>''k''</sub>, which we will need to determine through some sort of [[mathematical optimization|optimization]] procedure.
Line 128:
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;
</math>
The following function:
:<math>\operatorname{softmax}(k,
is referred to as the [[softmax function]]. The reason is that the effect of exponentiating the values <math>
:<math>f(k) = \begin{cases}
1
0
\end{cases}
</math>
Line 149 ⟶ 147:
Thus, we can write the probability equations as
:<math>\Pr(Y_i=
The softmax function thus serves as the equivalent of the [[logistic function]] in binary logistic regression.
Note that not all of the <math>\
:<math>
\begin{align}
\frac{e^{(\boldsymbol\
&= \frac{e^{\mathbf{C
&= \frac{e^{\boldsymbol\
\end{align}
</math>
As a result, it is conventional to set <math>\mathbf{C
:<math>
\begin{align}
\boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;
\boldsymbol\beta'_K &= 0.
\end{align}
</math>
Line 175 ⟶ 173:
:<math>
\Pr(Y_i=k) = \frac{e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta'_j \cdot \mathbf{X}_i}}, \;\;\;\;
</math>
Line 192 ⟶ 190:
where <math>\varepsilon_k \sim \operatorname{EV}_1(0,1),</math> i.e. a standard type-1 [[extreme value distribution]].
This latent variable can be thought of as the [[utility]] associated with data point ''i'' choosing outcome ''k'', where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable <math>Y_i</math> is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome ''k'' is chosen [[if and only if]] the associated utility (the value of <math>Y_{i,k}^{\ast}</math>) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome ''k'' is the maximum of all the utilities. Since the latent variables are [[continuous variable|continuous]], the probability of two having exactly the same value is 0, so we ignore the scenario. That is:
: <math>
Line 198 ⟶ 196:
\Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,2}^{\ast} \text{ and } Y_{i,1}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,1}^{\ast} > Y_{i,K}^{\ast}) \\
\Pr(Y_i = 2) &= \Pr(Y_{i,2}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,2}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,2}^{\ast} > Y_{i,K}^{\ast}) \\
& \,\,\,\vdots \\
\Pr(Y_i = K) &= \Pr(Y_{i,K}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,K}^{\ast} > Y_{i,2}^{\ast}\text{ and } \cdots \text{ and } Y_{i,K}^{\ast} > Y_{i,K-1}^{\ast}) \\
\end{align}
Line 231 ⟶ 229:
== Likelihood function ==
The observed values <math>y_i \in \{
The [[likelihood function]] for this model is defined by
:<math>L = \prod_{i=1}^n P(Y_i=y_i) = \prod_{i=1}^n
where the index <math>i</math> denotes the observations 1 to ''n'' and the index <math>j</math> denotes the classes 1 to ''K''. <math>\delta_{j,y_i}=\begin{cases}1, \text{ for } j=y_i \\ 0, \text{ otherwise}\end{cases}</math> is the [[Kronecker delta]]. The negative log-likelihood function is therefore the well-known cross-entropy:
:<math>-\log L = - \sum_{i=1}^n \sum_{j=1}^K \delta_{j,y_i} \log(P(Y_i=j))= - \sum_{j=1}^K\sum_{y_i=j}\log(P(Y_i=j)).</math> ==Application in natural language processing==
|