Content deleted Content added
Mathreader17 (talk | contribs) →Connection between RKHS with ReLU function: Rewrite section to explain more clearly |
Link suggestions feature: 2 links added. Tags: Visual edit Mobile edit Mobile web edit Advanced mobile edit Newcomer task Suggested: add links |
||
(98 intermediate revisions by 60 users not shown) | |||
Line 1:
{{Short description|In functional analysis, a Hilbert space}}
[[File:Different Views on RKHS.png|thumb|right|Figure illustrates related but varying approaches to viewing RKHS]]
In [[functional analysis]]
An immediate consequence of this property is that convergence in norm implies [[uniform convergence]] on any subset of <math>X</math> on which <math>\|K_x\|</math> is bounded. However, the converse does not necessarily hold. Often the set <math>X</math> carries a topology, and <math>\|K_x\|</math> depends continuously on <math>x\in X</math>, in which case: convergence in norm implies uniform convergence on compact subsets of <math>X</math>.
It is not entirely straightforward to construct natural examples of a Hilbert space which are not an RKHS in a non-trivial fashion.<ref>Alpay, D., and T. M. Mills. "A family of Hilbert spaces which are not reproducing kernel Hilbert spaces." J. Anal. Appl. 1.2 (2003): 107–111.</ref> Some examples, however, have been found.<ref> Z. Pasternak-Winiarski, "On weights which admit reproducing kernel of Bergman type", ''International Journal of Mathematics and Mathematical Sciences'', vol. 15, Issue 1, 1992. </ref><ref> T. Ł. Żynda, "On weights which admit reproducing kernel of Szegő type", ''Journal of Contemporary Mathematical Analysis'' (Armenian Academy of Sciences), 55, 2020. </ref>
An RKHS is associated with a kernel that reproduces every function in the space in the sense that for any <math>x</math> in the set on which the functions are defined, "evaluation at <math>x</math>" can be performed by taking an inner product with a function determined by the kernel. Such a ''reproducing kernel'' exists if and only if every evaluation functional is continuous.▼
While, formally, [[Square-integrable function|''L''<sup>2</sup> spaces]] are defined as Hilbert spaces of equivalence classes of functions, this definition can trivially be extended to a Hilbert space of functions by choosing a (total) function as a representative for each equivalence class. However, no choice of representatives can make this space an RKHS (<math>K_0</math> would need to be the non-existent Dirac delta function). However, there are RKHSs in which the norm is an ''L''<sup>2</sup>-norm, such as the space of band-limited functions (see the example below).
The reproducing kernel was first introduced in the 1907 work of [[Stanisław Zaremba (mathematician)|Stanisław Zaremba]] concerning [[boundary value problem]]s for [[Harmonic function|harmonic]] and [[Biharmonic equation|biharmonic functions]]. [[James Mercer (mathematician)|James Mercer]] simultaneously examined [[Positive-definite kernel|functions]] which satisfy the reproducing property in the theory of [[integral equation]]s. The idea of the reproducing kernel remained untouched for nearly twenty years until it appeared in the dissertations of [[Gábor Szegő]], [[Stefan Bergman]], and [[Salomon Bochner]]. The subject was eventually systematically developed in the early 1950s by [[Nachman Aronszajn]] and Stefan Bergman.<ref>Okutmustur</ref>▼
▲An RKHS is associated with a kernel that reproduces every function in the space in the sense that for
▲The reproducing kernel was first introduced in the 1907 work of [[Stanisław Zaremba (mathematician)|Stanisław Zaremba]]{{fact|date=June 2025}} concerning [[boundary value problem]]s for [[Harmonic function|harmonic]] and [[Biharmonic equation|biharmonic functions]]. [[James Mercer (mathematician)|James Mercer]] simultaneously examined [[Positive-definite kernel|functions]] which satisfy the reproducing property in the theory of [[integral equation]]s. The idea of the reproducing kernel remained untouched for nearly twenty years until it appeared in the dissertations of [[Gábor Szegő]], [[Stefan Bergman]], and [[Salomon Bochner]]. The subject was eventually systematically developed in the early 1950s by [[Nachman Aronszajn]] and Stefan Bergman.<ref>Okutmustur</ref>
These spaces have wide applications, including [[complex analysis]], [[harmonic analysis]], and [[quantum mechanics]]. Reproducing kernel Hilbert spaces are particularly important in the field of [[statistical learning theory]] because of the celebrated [[representer theorem]] which states that every function in an RKHS that minimises an empirical risk functional can be written as a [[linear combination]] of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the [[empirical risk minimization]] problem from an infinite dimensional to a finite dimensional optimization problem.
Line 14 ⟶ 19:
==Definition==
Let <math>X</math> be an arbitrary [[Set (mathematics)|set]] and <math>H</math> a [[Hilbert space]] of [[real-valued function]]s on <math>X</math>, equipped with pointwise addition and pointwise scalar multiplication. The [[Cartesian closed category#Evaluation|evaluation]] functional over the Hilbert space of functions <math>H</math> is a linear functional that evaluates each function at a point <math>x</math>,
:<math> L_{x} : f \mapsto f(x) \text{ } \forall f \in H. </math>
We say that ''H'' is a '''reproducing kernel Hilbert space''' if, for all <math>x</math> in <math>X</math>, <math>
{{NumBlk|:|<math> |
Although <math>M_x<\infty</math> is assumed for all <math>x \in X</math>, it might still be the case that <math display="inline">\sup_x M_x = \infty</math>.
While property ({{EquationNote|1}}) is the weakest condition that ensures both the existence of an inner product and the evaluation of every function in <math>H</math> at every point in the ___domain, it does not lend itself to easy application in practice. A more intuitive definition of the RKHS can be obtained by observing that this property guarantees that the evaluation functional can be represented by taking the inner product of <math> f </math> with a function <math> K_x </math> in <math>H</math> . This function is the so-called '''reproducing kernel''' for the Hilbert space <math>H</math> from which the RKHS takes its name. More formally, the [[Riesz representation theorem]] implies that for all <math>x</math> in <math>X</math> there exists a unique element <math> K_x </math> of <math>H</math> with the reproducing property,▼
▲While property ({{EquationNote|1}}) is the weakest condition that ensures both the existence of an inner product and the evaluation of every function in <math>H</math> at every point in the ___domain, it does not lend itself to easy application in practice. A more intuitive definition of the RKHS can be obtained by observing that this property guarantees that the evaluation functional can be represented by taking the inner product of <math> f </math> with a function <math> K_x </math> in <math>H</math>
{{NumBlk|:|<math> f(x) = L_{x}(f) = \langle f,\ K_x \rangle_H \quad \forall f \in H.</math>|{{EquationRef|2}}}}▼
▲{{NumBlk|:|<math> f(x) =
Since <math> K_x </math> is itself a function defined on <math>X</math> with values in the field <math>\mathbb{R}</math> (or <math>\mathbb{C}</math> in the case of complex Hilbert spaces) and as ▼
▲Since <math> K_x </math> is itself a function defined on <math>X</math> with values in the field <math>\mathbb{R}</math> (or <math>\mathbb{C}</math> in the case of complex Hilbert spaces) and as <math> K_x </math> is in <math>H</math> we have that
:<math> K_x(y) = L_y(K_x)= \langle K_x,\ K_y \rangle_H, </math>
where <math>K_y\in H</math> is the element in <math>H</math> associated to <math>L_y</math>.
This allows us to define the reproducing kernel of <math>H</math> as a function <math> K: X \times X \to \mathbb{R} </math> (or <math>\mathbb{C}</math> in the complex case) by
:<math> K(x,y) = \langle K_x,\ K_y \rangle_H. </math>
From this definition it is easy to see that <math> K: X \times X \to \mathbb{R} </math> (or <math>\mathbb{C}</math> in the complex case) is both symmetric (resp.
:<math> \sum_{i,j =1}^n c_i c_j K(x_i, x_j)=
\sum_{i=1}^n c_i \left\langle K_{x_i} , \sum_{j=1}^n c_j K_{x_j} \right\rangle_{H} =
\left\langle \sum_{i=1}^n c_i K_{x_i} , \sum_{j=1}^n c_j K_{x_j} \right\rangle_{H} =
\left\|\sum_{i=1}^nc_iK_{x_i}\right\|_H^2 \ge 0 </math>
for
==Examples==
The simplest example of a reproducing kernel Hilbert space is the space <math>L^2(X,\mu)</math> where <math>X</math> is a set and <math>\mu</math> is the [[counting measure]] on <math>X</math>. For <math>x\in X</math>, the reproducing kernel <math>K_x</math> is the [[indicator function]] of the one point set <math>\{x\}\subset X</math>.
▲:<math> \sum_{i,j =1}^n c_i c_j K(x_i, x_j)=\left\|\sum_{i=1}^nc_iK_{x_i}\right\|_H^2 \ge 0 </math>
:<math> H = \{ f \in L^2(\mathbb{R}) \mid \operatorname{supp}(F) \subset [-a,a] \} </math>
▲for any <math> n \in \mathbb{N}, x_1, \dots, x_n \in X, \text{ and } c_1, \dots, c_n \in \mathbb{R}. </math><ref>Durrett</ref> The Moore–Aronszajn theorem (see below) is a sort of converse to this: if a function <math>K</math> satisfies these conditions then there is a Hilbert space of functions on <math>X</math> for which it is a reproducing kernel.
where <math>
▲The space of [[Bandlimiting|bandlimited]] [[continuous function]]s <math>H</math> is a RKHS, as we now show. Formally, fix some [[cutoff frequency]] <math> 0<a < \infty </math> and define the Hilbert space
: <math>
Since this is a closed subspace of <math>L^2(\mathbb R)</math>, it is a Hilbert space. Moreover, the elements of <math>H</math> are smooth functions on <math>\mathbb R</math> that tend to zero at infinity, essentially by the [[Riemann-Lebesgue lemma]]. In fact, the elements of <math>H</math> are the restrictions to <math>\mathbb R</math> of entire [[holomorphic function]]s, by the [[Paley–Wiener theorem]].
▲where <math>C(\mathbb{R})</math> is the set of continuous functions, and <math> F(\omega) = \int_{-\infty}^{\infty} f(t) e^{-i\omega t} dt </math> is the [[Fourier transform]] of <math> f</math>.
From the [[Fourier inversion theorem]], we have
:<math> f(x) = \frac{1}{2 \pi} \int_{-a}^
It then follows by the [[Cauchy–Schwarz inequality]] and [[
:<math> |f(x)| \le
\frac{1}{2 \pi} \sqrt{ 2a\int_{-a}^
=\frac{
= \sqrt{\frac{a}{\pi}} \|f
This inequality shows that the evaluation functional is bounded, proving that <math> H </math> is indeed a RKHS.
Line 63 ⟶ 78:
The kernel function <math>K_x</math> in this case is given by
:<math>K_x(y) = \frac{a}{\pi} \operatorname{sinc}\left ( \frac{a}{\pi} (y-x) \right )=\frac{\sin(a(y-x))}{\pi(y-x)}.</math>
:<math>\int_{-\infty}^
\begin{cases}
e^{-i \omega x} &\text{if } \omega \in [-a, a], \\
0 &
\end{cases}
</math>
Line 76 ⟶ 91:
which is a consequence of the [[Fourier transform#Basic properties|time-shifting property of the Fourier transform]]. Consequently, using [[Plancherel's theorem]], we have
:<math> \langle f, K_x\rangle_{L^2} = \int_{-\infty}^
= \frac{1}{2\pi} \int_{-a}^
Thus we obtain the reproducing property of the kernel.
== Moore–Aronszajn theorem ==
Line 88 ⟶ 103:
:'''Theorem'''. Suppose ''K'' is a symmetric, [[positive definite kernel]] on a set ''X''. Then there is a unique Hilbert space of functions on ''X'' for which ''K'' is a reproducing kernel.
'''Proof'''. For all ''x'' in ''X'', define ''K<sub>x</sub>'' = ''K''(''x'', ⋅ ). Let ''H''<sub>0</sub> be the [[linear span]] of {''K<sub>x</sub>'' : ''x'' ∈ ''X''}. Define an inner product on ''H''<sub>0</sub> by
:<math> \left\langle \sum_{j=1}^n b_j K_{y_j}, \sum_{i=1}^m a_i K_{x_i} \right \rangle_{H_0} = \sum_{i=1}^m \sum_{j=1}^n {a_i} b_j K(y_j, x_i),</math>
Line 101 ⟶ 116:
Now we can check the reproducing property ({{EquationNote|2}}):
:<math>\langle f, K_x \
To prove uniqueness, let ''G'' be another Hilbert space of functions for which ''K'' is a reproducing kernel. For
:<math>\langle K_x, K_y \rangle_H = K(x, y) = \langle K_x, K_y \rangle_G.</math>
Line 109 ⟶ 124:
By linearity, <math>\langle \cdot, \cdot \rangle_H = \langle \cdot, \cdot \rangle_G</math> on the span of <math>\{K_x : x \in X\}</math>. Then <math>H \subset G</math> because ''G'' is complete and contains ''H''<sub>0</sub> and hence contains its completion.
Now we need to prove that every element of ''G'' is in ''H''. Let <math> f </math> be an element of ''G''. Since ''H'' is a closed subspace of ''G'', we can write <math> f=f_H + f_{H^
:<math>f(x) = \langle K_x , f \rangle_G = \langle K_x, f_H \rangle_G + \langle K_x, f_{H^
where we have used the fact that <math> K_x </math> belongs to ''H'' so that its inner product with <math> f_{H^
This shows that <math> f = f_H </math> in ''G'' and concludes the proof.
Line 119 ⟶ 134:
We may characterize a symmetric positive definite kernel <math>K</math> via the integral operator using [[Mercer's theorem]] and obtain an additional view of the RKHS. Let <math>X</math> be a compact space equipped with a strictly positive finite [[Borel measure]] <math>\mu</math> and <math>K: X \times X \to \R</math> a continuous, symmetric, and positive definite function. Define the integral operator <math>T_K: L_2(X) \to L_2(X)</math> as
:<math> [T_K f](\cdot) =\int_X K({}\cdot{},t) f(t)\, d\mu(t) </math>
where <math>L_2(X)</math> is the space of square integrable functions with respect to <math> \mu </math>.
Mercer's theorem states that the spectral decomposition of the integral operator <math>T_K</math> of <math>K</math> yields a series representation of <math>K</math> in terms of the eigenvalues and eigenfunctions of <math> T_K </math>. This then implies that <math>K</math> is a reproducing kernel so that the corresponding RKHS can be defined in terms of these eigenvalues and eigenfunctions.
Under these assumptions <math>T_K</math> is a compact, continuous, self-adjoint, and positive operator. The [[spectral theorem]] for self-adjoint operators implies that there is an at most countable decreasing sequence <math>(\sigma_i)
<math>T_K\
:<math> K(x,y) = \sum_{j=1}^\infty \sigma_j \, \
for all <math>x, y \in X</math> such that
:<math> \lim_{n \to \infty}\sup_{u,v} \left |K(u,v) - \sum_{j=1}^n \sigma_j \, \
This above series representation is referred to as a Mercer kernel or Mercer representation of <math> K </math>.
Line 138 ⟶ 153:
Furthermore, it can be shown that the RKHS <math> H </math> of <math> K </math> is given by
:<math> H = \left \{ f \in L_2(X) \
where the inner product of <math> H </math> given by
:<math> \left\langle f,g \right\rangle_H = \sum_{i=1}^\infty \frac{\left\langle f,\
This representation of the RKHS has application in probability and statistics, for example to the [[Karhunen–Loève theorem|
==Feature maps==
A '''feature map''' is a map <math> \varphi\colon X \rightarrow F </math>, where <math> F </math> is a Hilbert space which we will call the feature space. The first sections presented the connection between bounded/continuous evaluation functions, positive definite functions, and integral operators and in this section we provide another representation of the RKHS in terms of feature maps.
{{NumBlk|:|<math> K(x,y) = \langle \varphi(x), \varphi(y) \rangle_F. </math> |{{EquationRef|3}}}}
Line 155 ⟶ 170:
Clearly <math> K </math> is symmetric and positive definiteness follows from the properties of inner product in <math> F </math>. Conversely, every positive definite function and corresponding reproducing kernel Hilbert space has infinitely many associated feature maps such that ({{EquationNote|3}}) holds.
For example, we can trivially take <math> F = H </math> and <math> \varphi(x) = K_x </math> for all <math> x \in X </math>. Then ({{EquationNote|3}}) is satisfied by the reproducing property. Another classical example of a feature map relates to the previous section regarding integral operators by taking <math> F = \ell^2 </math> and <math> \varphi(x) = (\sqrt{\sigma_i} \
This connection between kernels and feature maps provides us with a new way to understand positive definite functions and hence reproducing kernels as inner products in <math> H </math>. Moreover, every feature map can naturally define a RKHS by means of the definition of a positive definite function.
Line 161 ⟶ 176:
Lastly, feature maps allow us to construct function spaces that reveal another perspective on the RKHS. Consider the linear space
:<math> H_
We can define a norm on <math> H_
:<math> \|f\|_
It can be shown that <math> H_{\varphi} </math> is a RKHS with kernel defined by <math> K(x,y) = \langle\varphi(x), \varphi(y)\rangle_F </math>. This representation implies that the elements of the RKHS are inner products of elements in the feature space and can accordingly be seen as hyperplanes. This view of the RKHS is related to the [[kernel trick]] in machine learning.<ref>Rosasco</ref>
Line 171 ⟶ 186:
==Properties==
* Let <math>(X_i)_{i=1}^p</math> be a sequence of sets and <math>(K_i)_{i=1}^p</math> be a collection of corresponding positive definite functions on <math> (X_i)_{i=1}^p.</math> It then follows that
*::<math>K((x_1,\ldots ,x_p),(y_1,\ldots,y_p)) = K_1(x_1,y_1)\cdots K_p(x_p,y_p)</math>
*:is a kernel on <math> X = X_1 \times \dots \times X_p.</math>
* Let <math>X_0 \subset X,</math> then the restriction of <math> K </math> to <math>X_0 \times X_0 </math> is also a reproducing kernel.
* Consider a normalized kernel <math>K</math> such that <math> K(x, x) = 1 </math> for all <math>x \in X </math>. Define a pseudo-metric on X as
*::<math> d_K(x,y) = \|K_x - K_y\|_H^2 = 2(1-K(x,y)) \qquad \forall x \in X . </math>
*:By the [[Cauchy–Schwarz inequality]],
*::<math> K(x,y)^2 \le K(x, x)K(y, y)=1 \qquad \forall x,y \in X.</math>
*:This inequality allows us to view <math>K</math> as a [[Similarity measure|measure of similarity]] between inputs. If <math>x, y \in X</math> are similar then <math>K(x,y)</math> will be closer to 1 while if <math>x,y \in X</math> are dissimilar then <math>K(x,y)</math> will be closer to 0.
*The closure of the span of <math> \{ K_x
== Common examples ==
Line 190 ⟶ 204:
===Bilinear kernels===
:<math> K(x,y) = \langle x,y\rangle </math>
The RKHS <math>H</math> corresponding to this kernel is the dual space, consisting of functions <math>f(x) = \langle x,\beta\rangle</math> satisfying <math>\|f\|_H^2=\|\beta\|^2
===Polynomial kernels===
:<math> K(x,y) = (\alpha\langle x,y \rangle + 1)^d, \qquad \alpha \in \R, d \in \N </math>
===[[Radial basis function kernel]]s===
These are another common class of kernels which satisfy <math> K(x,y) = K(\|x - y\|)
*'''Gaussian''' or '''squared exponential kernel''':
*::<math> K(x,y) = e^{-\frac{\|x - y\|^2}{2\sigma^2}}, \qquad \sigma > 0 </math>
*::<math> K(x,y) = e^{-\frac{\|x - y\|}{\sigma}}, \qquad \sigma > 0 </math>▼
▲* '''Laplacian Kernel''':
*:The squared norm of a function <math>f</math> in the RKHS <math>H</math> with this kernel is:<ref>Berlinet, Alain and Thomas, Christine. ''[https://books.google.com/books?
▲::<math> K(x,y) = e^{-\frac{\|x - y\|}{\sigma}}, \qquad \sigma > 0 </math>
*::<math>\|f\|_H^2=\int_{\mathbb R}\Big( \frac1{\sigma} f(x)^2 + \sigma f'(x)^2\Big) \mathrm d x.</math>
▲:The squared norm of a function <math>f</math> in the RKHS <math>H</math> with this kernel is:<ref>Berlinet, Alain and Thomas, Christine. ''[https://books.google.com/books?hl=en&lr=&id=bX3TBwAAQBAJ&oi=fnd&pg=PP11&dq=%22Reproducing+kernel+Hilbert+spaces+in+Probability+and+Statistics%22&ots=jV1gYX6vJ5&sig=um-eULpDSuKtXcYhzTYXwX8ZZzA#v=onepage&q=%22Reproducing%20kernel%20Hilbert%20spaces%20in%20Probability%20and%20Statistics%22&f=false Reproducing kernel Hilbert spaces in Probability and Statistics]'', Kluwer Academic Publishers, 2004</ref>
===[[Bergman kernel]]s===
Line 212 ⟶ 225:
:<math>K(x,y)=\begin{cases} 1 & x=y \\ 0 & x \neq y \end{cases}</math>
In this case, ''H'' is isomorphic to <math>\Complex^n
The case of <math>X= \mathbb{D}</math> (where <math>\mathbb{D}</math> denotes the [[unit disc]]) is more sophisticated. Here the [[Bergman space]] [[
:<math>K(x,y)=\frac{1}{\pi}\frac{1}{(1-x\overline{y})^2}.</math>
Lastly, the space of band limited functions in <math> L^2(\R) </math> with bandwidth <math>2a</math>
:<math>K(x,y)=\frac{\sin a (x - y)}{\pi (x-y)}.</math>
== Extension to vector-valued functions==
In this section we extend the definition of the RKHS to spaces of vector-valued functions as this extension is particularly important in [[multi-task learning]] and [[manifold regularization]]. The main difference is that the reproducing kernel <math> \Gamma </math> is a symmetric function that is now a positive semi-definite ''matrix'' for
:<math> \Gamma_xc(y) = \Gamma(x, y)c \in H \text{ for } y \in X </math>
Line 231 ⟶ 244:
:<math> \langle f, \Gamma_x c \rangle_H = f(x)^\intercal c. </math>
This second property parallels the reproducing property for the scalar-valued case.
We can gain intuition for the vvRKHS by taking a component-wise perspective on these spaces. In particular, we find that every vvRKHS is isometrically [[isomorphic]] to a scalar-valued RKHS on a particular input space. Let <math>\Lambda = \{1, \dots, T \} </math>. Consider the space <math> X \times \Lambda </math> and the corresponding reproducing kernel
Line 238 ⟶ 251:
As noted above, the RKHS associated to this reproducing kernel is given by the closure of the span of <math>\{ \gamma_{(x,t)} : x \in X, t \in \Lambda \} </math> where
<math>
The connection to the scalar-valued RKHS can then be made by the fact that every matrix-valued kernel can be identified with a kernel of the form of ({{EquationNote|4}}) via
Line 244 ⟶ 257:
:<math> \Gamma(x,y)_{(t,s)} = \gamma((x,t), (y,s)). </math>
Moreover, every kernel with the form of ({{EquationNote|4}}) defines a matrix-valued kernel with the above expression. Now letting the map <math> D: H_
:<math> (Df)(x,t) = \langle f(x), e_t \rangle_{\mathbb{R}^T} </math>
where <math> e_t </math> is the <math> t^\text{th} </math> component of the canonical basis for <math> \mathbb{R}^T </math>, one can show that <math> D </math> is bijective and an isometry between <math> H_
While this view of the vvRKHS can be useful in multi-task learning, this isometry does not reduce the study of the vector-valued case to that of the scalar-valued case. In fact, this isometry procedure can make both the scalar-valued kernel and the input space too difficult to work with in practice as properties of the original kernels are often lost.<ref>De Vito</ref><ref>Zhang</ref><ref>Alvarez</ref>
Line 257 ⟶ 270:
for all <math>x,y </math> in <math> X </math> and <math>t,s</math> in <math> T </math>. As the scalar-valued kernel encodes dependencies between the inputs, we can observe that the matrix-valued kernel encodes dependencies among both the inputs and the outputs.
We lastly remark that the above theory can be further extended to spaces of functions with values in function spaces but obtaining kernels for these spaces is a more difficult task.<ref>Rosasco</ref>
The [[Rectifier (neural networks)|ReLU function]] is commonly defined as <math>f(x)=\max
We will work with the Hilbert space <math> \mathcal{H}=L^1_2(0)[0, \infty) </math> of absolutely continuous functions with <math>f(0) = 0</math> and square integrable (i.e. <math>L_2</math>) derivative. It has the inner product
▲== Connection between RKHS with ReLU function ==
▲The [[Rectifier (neural networks)|ReLU function]] is commonly defined as <math>f(x)=\max (0, x)</math> and is a mainstay in the architecture of neural networks, used as an activation function.
To construct the reproducing kernel it suffices to consider a dense subspace, so let <math>f\in C^1[0, \infty)</math> and <math>f(0)=0</math>.
The Fundamental Theorem of Calculus then gives
x, & \text{if } 0\leq x<t\\▼
t, & \text{otherwise}▼
\end{cases}=\min(x, t)</math>.▼
: <math>
where
:<math>G(x,y)=
\begin{cases} 1, & x < y\\
0, & \text{otherwise}
and <math>K_y'(x)= G(x,y),\ K_y(0) = 0</math> i.e.
:<math>K(x, y)=K_y(x)=\int_0^x G(z, y) \, dz=
\begin{cases}
This implies <math>K_y=K(\cdot, y)</math> reproduces <math>f</math>.
Moreover the minimum function on <math> X\times X = [0,\infty)\times [0,\infty) </math> has the following representations with the ReLu function:
: <math> \min(x,y) = x -\operatorname{ReLU}(x-y) = y - \operatorname{ReLU}(y-x). </math>
Using this formulation, we can apply the [[representer theorem]] to the RKHS, letting one prove the optimality of using ReLU activations in neural network settings.{{Citation needed|date=January 2022|reason=Optimal in what sense?}}
▲<math> f(t)=\int_{0}^{t} f'(x) dx=\int_{0}^{\infty} G(x,t)f'(x) dx = \int_{0}^{\infty} K_{t}'(x)f'(x) dx= \langle K_t(x),f \rangle_{\mathcal{H}} </math>
== See also ==
*[[Positive definite kernel]]
Line 286 ⟶ 313:
*[[Kernel trick]]
*[[Kernel embedding of distributions]]
*
==Notes==
Line 294 ⟶ 321:
*Alvarez, Mauricio, Rosasco, Lorenzo and Lawrence, Neil, “Kernels for Vector-Valued Functions: a Review,” https://arxiv.org/abs/1106.6251, June 2011.
* {{cite journal
|first=Nachman |last=Aronszajn |
|title=Theory of Reproducing Kernels
|journal=[[Transactions of the American Mathematical Society]]
Line 312 ⟶ 339:
|mr=1864085
|doi-access=free}}
*De Vito, Ernest, Umanita, Veronica, and Villa, Silvia. "An extension of Mercer theorem to vector-valued measurable kernels,"
*Durrett, Greg. 9.520 Course Notes, Massachusetts Institute of Technology, https://www.mit.edu/~9.520/scribe-notes/class03_gdurett.pdf, February 2010.
* {{cite journal
|
|first2=Grace |last2=Wahba |
|url=http://www.stat.wisc.edu/~wahba/ftp1/oldie/kw71.pdf
|title=Some results on Tchebycheffian Spline Functions
Line 322 ⟶ 349:
|volume=33 |issue=1 |year=1971 |pages=82–95 |doi=10.1016/0022-247X(71)90184-3
|mr=290013
|doi-access=free}}
}}▼
*Okutmustur, Baver. “Reproducing Kernel Hilbert Spaces,” M.S. dissertation, Bilkent University,
*Paulsen, Vern. “An introduction to the theory of reproducing kernel Hilbert spaces,”
* {{cite journal
|
|first2=Clint |last2=Scovel
|title= Mercer's theorem on general domains: On the interaction between measures, kernels, and RKHSs
Line 332 ⟶ 359:
|volume=35 |issue=3|year=2012 |pages=363–417
|mr=2914365 |doi=10.1007/s00365-012-9153-3
|s2cid=253885172
▲ }}
* Rosasco, Lorenzo and Poggio, Thomas. "A Regularization Tour of Machine Learning – MIT 9.520 Lecture Notes" Manuscript, Dec. 2014.
* [[Grace Wahba|Wahba, Grace]], ''Spline Models for Observational Data'', [http://www.siam.org/books/ SIAM], 1990.
*{{cite journal | last1 = Zhang | first1 = Haizhang | last2 = Xu | first2 = Yuesheng | last3 = Zhang | first3 = Qinghui | year = 2012 | title = Refinement of Operator-valued Reproducing Kernels | journal = Journal of Machine Learning Research | volume = 13
[[Category:Hilbert
|