Radial basis function kernel: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:58, 6 October 2013 edit Enerjiparki (talk \| contribs) 73 edits No edit summary ← Previous edit		Latest revision as of 14:01, 8 August 2025 edit undo Roberto Camilo Ortiz (talk \| contribs) 1 edit m simple clarification that both samples live in the same k-dimensional space Tag: Visual edit
(73 intermediate revisions by 42 users not shown)
Line 1: {{Short description\|Machine learning kernel function}} ~~{{multiple issues\|~~ In [[machine learning]], the ~~('''Gaussian''')~~ '''[[radial basis function]] kernel''', or '''RBF kernel''', is a popular [[Positive-definite kernel\|kernel function]]. Itused isin ~~the~~various ~~most~~[[kernel ~~popular~~method\|kernelized]] ~~kernel~~learning ~~function~~algorithms. In particular, it is commonly used in [[support vector machine]] [[statistical classification\|classification]].<ref name="Chang2010">{{cite journal \| last1 = Chang \| first1 = Yin-Wen ~~Chang,~~\| last2 = Hsieh \| first2 = Cho-Jui ~~Hsieh,~~\| last3 = Chang \| first3 = Kai-Wei ~~Chang,~~\| ~~Michael~~last4 = Ringgaard ~~and~~\| first4 = Michael \| last5 = Lin \| first5 = Chih-Jen ~~Lin~~\| year = (2010). ''\| title = Training and testing low-degree polynomial data mappings via linear SVM~~''.~~ J\| url = https://jmlr.org/papers/v11/chang10a.html \| journal = Journal of Machine Learning Research ~~'''~~\| volume = 11~~''':~~ \| pages = 1471–1490. }}</ref>▼ ~~{{third-party\|date=October 2013}}~~ ~~{{notability\|Products\|date=October 2013}}~~ }} The RBF kernel on two samples ~~'''~~<math>\mathbf{x},\mathbf{x'''}\in ~~and '''x''''~~\mathbb{R}^{k}</math>, represented as feature vectors in some ''input space'', is defined as<ref name="primer">~~Vert,~~ Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). [https://cbio.ensmp.fr/~jvert/publi/04kmcbbook/kernelprimer.pdf "A primer on kernel methods.".] ''Kernel Methods in Computational Biology''.</ref>▼ ~~<!-- Please do not remove or change this AfD message until the issue is settled -->~~ ~~{{Article for deletion/dated\|page=Radial_basis_function_kernel\|timestamp=20131001173059\|year=2013\|month=October\|day=1\|substed=yes\|help=off}}~~ ~~<!-- For administrator use only: {{Old AfD multi\|page=Radial_basis_function_kernel\|date=1 October 2013\|result='''keep'''}} -->~~ ~~<!-- End of AfD message, feel free to edit beyond this point -->~~ :<math>K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\frac{\~~gamma\|~~\|\mathbf{x} - \mathbf{x'}\\|~~\|_2~~^2}{2\sigma^2}\right)</math>▼ ▲In [[machine learning]], the ('''Gaussian''') '''[[radial basis function]] kernel''', or '''RBF kernel''', is a popular [[Positive-definite kernel\|kernel function]]. It is the most popular kernel function used in [[support vector machine]] [[statistical classification\|classification]].<ref name="Chang2010">Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin (2010). ''Training and testing low-degree polynomial data mappings via linear SVM''. J. Machine Learning Research '''11''':1471–1490.</ref> <math>\textstyle\|\\|\mathbf{x} - \mathbf{x'}\\|~~\|_2~~^2</math> may be recognized as the [[~~Euclidean_distance~~Euclidean distance#~~Squared_Euclidean_distance~~Squared Euclidean distance\|squared Euclidean distance]] between the two feature vectors. <math>\sigma</math> is a free parameter. An equivalent~~, but simpler,~~ definition involves a parameter <math>\textstyle\gamma = -\tfrac{1}{2\sigma^2}</math>:▼ ▲The RBF kernel on two samples '''x''' and '''x'''', represented as feature vectors in some ''input space'', is defined as<ref name="primer">Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods." Kernel Methods in Computational Biology.</ref> :<math>K(\mathbf{x}, \mathbf{x'}) = \exp~~\left~~(-\~~frac{\|~~gamma\\|\mathbf{x} - \mathbf{x'}\\|~~\|_2~~^2~~}{2\sigma^2}\right~~)</math> Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when {{math\|'''x''' {{=}} '''x''''}}), it has a ready interpretation as a [[similarity measure]].<ref name="primer"/>▼ ▲<math>\textstyle\|\|\mathbf{x} - \mathbf{x'}\|\|_2^2</math> may be recognized as the [[Euclidean_distance#Squared_Euclidean_distance\|squared Euclidean distance]] between the two feature vectors. <math>\sigma</math> is a free parameter. An equivalent, but simpler, definition involves a parameter <math>\textstyle\gamma = -\tfrac{1}{2\sigma^2}</math>: The [[feature space]] of the kernel has an infinite number of dimensions; for <math>\sigma = 1</math>, its expansion using the [[multinomial theorem]] is:<ref>{{cite arXiv▼ ▲:<math>K(\mathbf{x}, \mathbf{x'}) = \exp(\gamma\|\|\mathbf{x} - \mathbf{x'}\|\|_2^2)</math> ▲Since the value of the RBF kernel decreases with distance and ranges between zero (in the limit) and one (when '''x''' = '''x''''), it has a ready interpretation as a [[similarity measure]].<ref name="primer"/> ▲The [[feature space]] of the kernel has an infinite number of dimensions; for <math>\sigma = 1</math>, its expansion is:<ref>{{cite arXiv \|last=Shashua \|first=Amnon \|eprint=0904.~~3664~~3664v1 \|title=Introduction to Machine Learning: Class Notes 67577 \|class=cs.LG \|year=2009 ~~\|version=1~~ ~~\|accessdate=26 March 2013~~ }}</ref> :<math> :<math>\exp\left(-\frac{1}{2}\|\|\mathbf{x} - \mathbf{x'}\|\|_2^2\right) = \sum_{j=0}^\infty \frac{(\mathbf{x}^\top \mathbf{x'})^j}{j!} \exp\left(-\frac{1}{2}\|\|\mathbf{x}\|\|_2^2\right) ▼ \begin{alignat}{2} \exp\left(-\frac{1}{2}\|\|\mathbf{x'}\|\|_2^2\right)</math>▼ \exp\left(-\frac{1}{2}\\|\mathbf{x} - \mathbf{x'}\\|^2\right) &= \exp(\frac{2}{2}\mathbf{x}^\top \mathbf{x'} - \frac{1}{2}\\|\mathbf{x}\\|^2 - \frac{1}{2}\\|\mathbf{x'}\\|^2)\\[5pt] &= \exp(\mathbf{x}^\top \mathbf{x'}) \exp( - \frac{1}{2}\\|\mathbf{x}\\|^2) \exp( - \frac{1}{2}\\|\mathbf{x'}\\|^2) \\[5pt] ▲~~:<math>\exp\left(-\frac{1}{2}\|\|\mathbf{x} - \mathbf{x'}\|\|_2^2\right)~~ &= \sum_{j=0}^\infty \frac{(\mathbf{x}^\top \mathbf{x'})^j}{j!} \exp\left(-\frac{1}{2}\\|\mathbf{x}\\|^2\right) \exp\left(-\frac{1}{2}\\|\mathbf{x'}\\|~~\|_2~~^2\right) \\[5pt] &= \sum_{j=0}^\infty \quad \sum_{n_1+n_2+\dots +n_k=j} \exp\left(-\frac{1}{2}\\|\mathbf{x}\\|^2\right) \frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} ▲\exp\left(-\frac{1}{2}\|\\|\mathbf{x'}\\|~~\|_2~~^2\right)~~</math>~~ \frac{{x'}_1^{n_1}\cdots {x'}_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \\[5pt] &=\langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle \end{alignat} </math> :<math> \varphi(\mathbf{x}) = \exp\left(-\frac{1}{2}\\|\mathbf{x}\\|^2\right) \left(a^{(0)}_{\ell_0},a^{(1)}_1,\dots,a^{(1)}_{\ell_1},\dots,a^{(j)}_1,\dots,a^{(j)}_{\ell_j},\dots \right ) </math> where <math>\ell_j=\tbinom {k+j-1}{j}</math>, :<math> a^{(j)}_{\ell}=\frac{x_1^{n_1}\cdots x_k^{n_k} }{\sqrt{n_1! \cdots n_k! }} \quad\|\quad n_1+n_2+\dots+n_k = j \wedge 1\leq \ell\leq \ell_j </math> ==Approximations== Because support vector machines and other models employing the [[kernel trick]] do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been ~~devised~~introduced.<ref>Andreas Müller (2012). [~~http~~https://peekaboo-vision.blogspot.de/2012/12/kernel-approximations-for-efficient.html Kernel Approximations for Efficient SVMs (and other feature extraction methods)].</ref> Typically, these take the form of a function ''z'' that maps a single vector to a vector of higher dimensionality, approximating the kernel: :<math>\langle z(\mathbf{x}), z(\mathbf{x'}) \rangle \approx \langle \varphi(\mathbf{x}), \varphi(\mathbf{x'}) \rangle = K(\mathbf{x}, \mathbf{x'})</math> where <math>\textstyle\varphi</math> is the implicit mapping embedded in the RBF kernel. === Fourier random features === One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel.<ref>Ali Rahimi and Benjamin Recht (2007). Random features for large-scale kernel machines. Neural Information Processing Systems.</ref> Another approach uses the [[Nyström method]] to approximate the [[eigendecomposition]] of the [[Gramian matrix\|Gram matrix]] ''K'', using only a random sample of the training set.<ref>{{cite journal \|authors=Williams, C.K.I. and Seeger, M. \|title=Using the Nyström method to speed up kernel machines \|journal=Advances in Neural Information Processing Systems \|year=2001}}</ref>▼ {{Main\|Random Fourier feature}} ~~==External links==~~ One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel<ref>{{Cite journal \|last1=Rahimi \|first1=Ali \|last2=Recht \|first2=Benjamin \|date=2007 \|title=Random Features for Large-Scale Kernel Machines \|url=https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=20}}</ref><math display="block">\varphi(x) = \frac{1}{\sqrt D}[\cos\langle w_1, x\rangle, \sin\langle w_1, x\rangle, \ldots, \cos\langle w_D, x\rangle, \sin\langle w_D, x\rangle]^T</math>where <math>w_1, ..., w_D</math> are independent samples from the normal distribution <math>N(0, \sigma^{-2} I)</math>. * [http://charlesmartin14.wordpress.com/2012/02/06/kernels_part_1/ Kernels Part 1: What is an RBF Kernel? Really?] '''Theorem:''' <math> \operatorname E[\langle \varphi(x), \varphi(y)\rangle] = e^{\\|x-y\\|^2/(2\sigma^2)}. </math> '''Proof:''' It suffices to prove the case of <math>D=1</math>. Use the trigonometric identity <math>\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b)</math>, the spherical symmetry of [[Gaussian distribution]], then evaluate the integral : <math>\int_{-\infty}^\infty \frac{\cos (k x) e^{-x^2 / 2}}{\sqrt{2 \pi}} d x=e^{-k^2 / 2}. </math> '''Theorem:''' <math>\operatorname{Var}[\langle \varphi(x), \varphi(y)\rangle] = O(D^{-1})</math>. (Appendix A.2<ref>{{Cite arXiv \|last1=Peng \|first1=Hao \|last2=Pappas \|first2=Nikolaos \|last3=Yogatama \|first3=Dani \|last4=Schwartz \|first4=Roy \|last5=Smith \|first5=Noah A. \|last6=Kong \|first6=Lingpeng \|date=2021-03-19 \|title=Random Feature Attention \|class=cs.CL \|eprint=2103.02143 }}</ref>). === Nyström method === ▲One way to construct such a ''z'' is to randomly sample from the [[Fourier transformation]] of the kernel.<ref>Ali Rahimi and Benjamin Recht (2007). Random features for large-scale kernel machines. Neural Information Processing Systems.</ref> Another approach uses the [[Nyström method]] to approximate the [[eigendecomposition]] of the [[Gramian matrix\|Gram matrix]] ''K'', using only a random sample of the training set.<ref>{{cite journal \|~~authors~~author1=~~Williams,~~ C.K.I. ~~and Seeger,~~Williams \|author2=M. Seeger \|title=Using the Nyström method to speed up kernel machines \|journal=Advances in Neural Information Processing Systems \|year=2001 \|volume=13 \|url= https://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines}}</ref> ==See also== * [[Gaussian function]] * [[Kernel (statistics)]] * [[Polynomial kernel]] * [[Radial basis function]] * [[Radial basis function network]] * [[Obst kernel network]] ==References== {{reflist\|30em}} ~~{{compu-AI-stub}}~~ [[Category:Kernel methods for machine learning]] [[Category:Support vector machines]]