Content deleted Content added
m simple clarification that both samples live in the same k-dimensional space |
|||
(53 intermediate revisions by 31 users not shown) | |||
Line 1:
{{Short description|Machine learning kernel function}}
In [[machine learning]], the
The RBF kernel on two samples
:<math>K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\frac{
<math>\textstyle
:<math>K(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma
Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when {{math|'''x''' {{=}} '''x''''}}), it has a ready interpretation as a [[similarity measure]].<ref name="primer"/>
The [[feature space]] of the kernel has an infinite number of dimensions; for <math>\sigma = 1</math>, its expansion using the [[multinomial theorem]] is:<ref>{{cite arXiv
|last=Shashua
|first=Amnon
|eprint=0904.
|title=Introduction to Machine Learning: Class Notes 67577
|class=cs.LG
|year=2009
}}</ref>
:<math>
:<math>\exp\left(-\frac{1}{2}||\mathbf{x} - \mathbf{x'}||^2\right) = \sum_{j=0}^\infty \frac{(\mathbf{x}^\top \mathbf{x'})^j}{j!} \exp\left(-\frac{1}{2}||\mathbf{x}||^2\right) ▼
\begin{alignat}{2}
\exp\left(-\frac{1}{2}||\mathbf{x'}||^2\right)</math>▼
\exp\left(-\frac{1}{2}\|\mathbf{x} - \mathbf{x'}\|^2\right)
&= \exp(\frac{2}{2}\mathbf{x}^\top \mathbf{x'} - \frac{1}{2}\|\mathbf{x}\|^2 - \frac{1}{2}\|\mathbf{x'}\|^2)\\[5pt]
&= \exp(\mathbf{x}^\top \mathbf{x'}) \exp( - \frac{1}{2}\|\mathbf{x}\|^2) \exp( - \frac{1}{2}\|\mathbf{x'}\|^2) \\[5pt]
▲
&= \sum_{j=0}^\infty \quad \sum_{n_1+n_2+\dots +n_k=j}
\exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right)
\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }}
\frac{{x'}_1^{n_1}\cdots {x'}_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \\[5pt]
&=\langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle
\end{alignat}
</math>
:<math>
\varphi(\mathbf{x})
=
\exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right)
\left(a^{(0)}_{\ell_0},a^{(1)}_1,\dots,a^{(1)}_{\ell_1},\dots,a^{(j)}_1,\dots,a^{(j)}_{\ell_j},\dots \right )
</math>
where <math>\ell_j=\tbinom {k+j-1}{j}</math>,
:<math>
a^{(j)}_{\ell}=\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \quad|\quad n_1+n_2+\dots+n_k = j \wedge 1\leq \ell\leq \ell_j
</math>
==Approximations==
Because support vector machines and other models employing the [[kernel trick]] do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been
Typically, these take the form of a function ''z'' that maps a single vector to a vector of higher dimensionality, approximating the kernel:
:<math>\langle z(\mathbf{x}), z(\mathbf{x'}) \rangle \approx \langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle = K(\mathbf{x}, \mathbf{x'})</math>
where <math>\textstyle\varphi</math> is the implicit mapping embedded in the RBF kernel.
=== Fourier random features ===
One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel.<ref>Ali Rahimi and Benjamin Recht (2007). [http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf "Random features for large-scale kernel machines"]. ''Neural Information Processing Systems''.</ref> Another approach uses the [[Nyström method]] to approximate the [[eigendecomposition]] of the [[Gramian matrix|Gram matrix]] ''K'', using only a random sample of the training set.<ref>{{cite journal |authors=Williams, C.K.I. and Seeger, M. |title=Using the Nyström method to speed up kernel machines |journal=Advances in Neural Information Processing Systems |year=2001 |url= http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines}}</ref>▼
{{Main|Random Fourier feature}}
One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel<ref>{{Cite journal |last1=Rahimi |first1=Ali |last2=Recht |first2=Benjamin |date=2007 |title=Random Features for Large-Scale Kernel Machines |url=https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=20}}</ref><math display="block">\varphi(x) = \frac{1}{\sqrt D}[\cos\langle w_1, x\rangle, \sin\langle w_1, x\rangle, \ldots, \cos\langle w_D, x\rangle, \sin\langle w_D, x\rangle]^T</math>where <math>w_1, ..., w_D</math> are independent samples from the normal distribution <math>N(0, \sigma^{-2} I)</math>.
'''Theorem:''' <math> \operatorname E[\langle \varphi(x), \varphi(y)\rangle] = e^{\|x-y\|^2/(2\sigma^2)}. </math>
'''Proof:''' It suffices to prove the case of <math>D=1</math>. Use the trigonometric identity <math>\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b)</math>, the spherical symmetry of [[Gaussian distribution]], then evaluate the integral
: <math>\int_{-\infty}^\infty \frac{\cos (k x) e^{-x^2 / 2}}{\sqrt{2 \pi}} d x=e^{-k^2 / 2}. </math>
'''Theorem:''' <math>\operatorname{Var}[\langle \varphi(x), \varphi(y)\rangle] = O(D^{-1})</math>. (Appendix A.2<ref>{{Cite arXiv |last1=Peng |first1=Hao |last2=Pappas |first2=Nikolaos |last3=Yogatama |first3=Dani |last4=Schwartz |first4=Roy |last5=Smith |first5=Noah A. |last6=Kong |first6=Lingpeng |date=2021-03-19 |title=Random Feature Attention |class=cs.CL |eprint=2103.02143 }}</ref>).
=== Nyström method ===
▲
==See also==
Line 43 ⟶ 76:
* [[Radial basis function]]
* [[Radial basis function network]]
* [[Obst kernel network]]
==References==
Line 48 ⟶ 82:
[[Category:Kernel methods for machine learning]]
[[Category:Support vector machines]]
|