Content deleted Content added
Adding the objective of operation |
m simple clarification that both samples live in the same k-dimensional space |
||
(26 intermediate revisions by 11 users not shown) | |||
Line 1:
{{Short description|Machine learning kernel function}}
In [[machine learning]], the '''[[radial basis function]] kernel''', or '''RBF kernel''', is a popular [[Positive-definite kernel|kernel function]] used in various [[kernel method|kernelized]] learning algorithms. In particular, it is commonly used in [[support vector machine]] [[statistical classification|classification]].<ref name="Chang2010">{{cite journal | last1 = Chang | first1 = Yin-Wen | last2 = Hsieh | first2 = Cho-Jui | last3 = Chang | first3 = Kai-Wei | last4 = Ringgaard | first4 = Michael | last5 = Lin | first5 = Chih-Jen | year = 2010 | title = Training and testing low-degree polynomial data mappings via linear SVM | url =
The RBF kernel on two samples <math>\mathbf{x},\mathbf{x'}\in \mathbb{R}^{k}</math>
:<math>K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x'}\|^2}{2\sigma^2}\right)</math>
Line 9 ⟶ 10:
:<math>K(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma\|\mathbf{x} - \mathbf{x'}\|^2)</math>
Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when {{math|'''x''' {{=}} '''x''''}}), it has a ready interpretation as a [[similarity measure]].<ref name="primer"/>
The [[feature space]] of the kernel has an infinite number of dimensions; for <math>\sigma = 1</math>, its expansion using the [[multinomial theorem]] is:<ref>{{cite arXiv
|last=Shashua
|first=Amnon
Line 22 ⟶ 23:
\begin{alignat}{2}
\exp\left(-\frac{1}{2}\|\mathbf{x} - \mathbf{x'}\|^2\right)
&= \exp(\frac{2}{2}\mathbf{x}^\top \mathbf{x'} - \frac{1}{2}\|\mathbf{x}\|^2 - \frac{1}{2}\|\mathbf{x'}\|^2)\\[5pt]
&= \exp(\mathbf{x}^\top \mathbf{x'}) \exp( - \frac{1}{2}\|\mathbf{x}\|^2) \exp( - \frac{1}{2}\|\mathbf{x'}\|^2) \\[5pt]
&= \sum_{j=0}^\infty \frac{(\mathbf{x}^\top \mathbf{x'})^j}{j!} \exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right) \exp\left(-\frac{1}{2}\|\mathbf{x'}\|^2\right)\\[5pt]
&= \sum_{j=0}^\infty \quad \sum_{n_1+n_2+\
\exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right)
\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }}
\exp\left(-\frac{1}{2}\|\mathbf{x'}\|^2\right)
\frac{{x'}_1^{n_1}\cdots {x'}_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \\[5pt]
\end{alignat}▼
&=\langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle
▲\end{alignat}
</math>
:<math>
\varphi(\mathbf{x})
=
\exp\left(-\frac{1}{2}\|\mathbf{x}\|^2\right)
\left(a^{(0)}_{\ell_0},a^{(1)}_1,\dots,a^{(1)}_{\ell_1},\dots,a^{(j)}_1,\dots,a^{(j)}_{\ell_j},\dots \right )
</math>
where <math>\ell_j=\tbinom {k+j-1}{j}</math>,
:<math>
a^{(j)}_{\ell}=\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \quad|\quad n_1+n_2+\dots+n_k = j \wedge 1\leq \ell\leq \ell_j
</math>
==Approximations==
Because support vector machines and other models employing the [[kernel trick]] do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced.<ref>Andreas Müller (2012). [
Typically, these take the form of a function ''z'' that maps a single vector to a vector of higher dimensionality, approximating the kernel:
Line 42 ⟶ 53:
where <math>\textstyle\varphi</math> is the implicit mapping embedded in the RBF kernel.
=== Fourier random features ===
One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel.<ref>Ali Rahimi and Benjamin Recht (2007). [http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf "Random features for large-scale kernel machines"]. ''Neural Information Processing Systems''.</ref> Another approach uses the [[Nyström method]] to approximate the [[eigendecomposition]] of the [[Gramian matrix|Gram matrix]] ''K'', using only a random sample of the training set.<ref>{{cite journal |authors=C.K.I. Williams and M. Seeger|title=Using the Nyström method to speed up kernel machines |journal=Advances in Neural Information Processing Systems |year=2001 |url= http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines}}</ref>▼
{{Main|Random Fourier feature}}
One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel<ref>{{Cite journal |last1=Rahimi |first1=Ali |last2=Recht |first2=Benjamin |date=2007 |title=Random Features for Large-Scale Kernel Machines |url=https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=20}}</ref><math display="block">\varphi(x) = \frac{1}{\sqrt D}[\cos\langle w_1, x\rangle, \sin\langle w_1, x\rangle, \ldots, \cos\langle w_D, x\rangle, \sin\langle w_D, x\rangle]^T</math>where <math>w_1, ..., w_D</math> are independent samples from the normal distribution <math>N(0, \sigma^{-2} I)</math>.
'''Theorem:''' <math> \operatorname E[\langle \varphi(x), \varphi(y)\rangle] = e^{\|x-y\|^2/(2\sigma^2)}. </math>
'''Proof:''' It suffices to prove the case of <math>D=1</math>. Use the trigonometric identity <math>\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b)</math>, the spherical symmetry of [[Gaussian distribution]], then evaluate the integral
: <math>\int_{-\infty}^\infty \frac{\cos (k x) e^{-x^2 / 2}}{\sqrt{2 \pi}} d x=e^{-k^2 / 2}. </math>
'''Theorem:''' <math>\operatorname{Var}[\langle \varphi(x), \varphi(y)\rangle] = O(D^{-1})</math>. (Appendix A.2<ref>{{Cite arXiv |last1=Peng |first1=Hao |last2=Pappas |first2=Nikolaos |last3=Yogatama |first3=Dani |last4=Schwartz |first4=Roy |last5=Smith |first5=Noah A. |last6=Kong |first6=Lingpeng |date=2021-03-19 |title=Random Feature Attention |class=cs.CL |eprint=2103.02143 }}</ref>).
=== Nyström method ===
▲
==See also==
Line 50 ⟶ 76:
* [[Radial basis function]]
* [[Radial basis function network]]
* [[Obst
==References==
|