Universal approximation theorem: Difference between revisions

Content deleted Content added
Undid revision 1214049992 by 202.131.155.130 (talk)
Tags: Mobile edit Mobile web edit Advanced mobile edit
Line 75:
The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017.<ref name=ZhouLu /> They showed that networks of width ''n''&nbsp;+&nbsp;4 with [[ReLU]] activation functions can approximate any [[Lebesgue integration|Lebesgue-integrable function]] on ''n''-dimensional input space with respect to [[L1 distance|<math>L^1</math> distance]] if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper<ref name=ZhouLu /> it was shown that [[ReLU]] networks with width ''n''&nbsp;+&nbsp;1 were sufficient to approximate any [[continuous function|continuous]] function of ''n''-dimensional input variables.<ref>Hanin, B. (2018). [[arxiv:1710.11278|Approximating Continuous Functions by ReLU Nets of Minimal Width]]. arXiv preprint arXiv:1710.11278.</ref> The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.<ref>{{Cite journal |last=Park, Yun, Lee, Shin |first=Sejun, Chulhee, Jaeho, Jinwoo |date=2020-09-28 |title=Minimum Width for Universal Approximation |url=https://openreview.net/forum?id=O-XJwyoIF-k |journal=ICLR |arxiv=2006.08859 |language=en}}</ref>
 
{{math theorem
<blockquote>
'''| name = Universal approximation theorem''' ''(L1 distance, ReLU activation, arbitrary depth, minimal width).''
| math_statement = For any [[Bochner integral|Bochner–Lebesgue p-integrable]] function <math>f : \mathbb R^n \to \mathbb R^m</math> and any <math>\varepsilon > 0</math>, there exists a [[fully connected network|fully connected]] [[ReLU]] network <math>F</math> of width exactly <math>d_m = \max\{n + 1, m\}</math>, satisfying
: <math display="block">\int_{\mathbb R^n} \|f(x) - F(x)\|^p \, \mathrm{d}x < \varepsilon.</math>
Moreover, there exists a function <math>f \in L^p(\mathbb{R}^n, \mathbb{R}^m)</math> and some <math>\varepsilon > 0</math>, for which there is no [[fully connected network|fully connected]] [[ReLU]] network of width less than <math>d_m = \max\{n + 1 ,m\}</math> satisfying the above approximation bound.
 
Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact ___domain, then the exact minimum width is<ref name=":1" /> <math>d_m = \max\{n, m, 2\}</math>.
 
''Quantitative refinement:'' In the case where, when <math>\mathcal{X} = [0, 1]^d</math> and <math>D = 1</math> and where <math>\sigma</math> is the [[Rectifier (neural networks)|ReLU activation function]], the exact depth and width for a ReLU network to achieve <math>\varepsilon</math> error is also known.<ref>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |date=January 2022 |volume=157 |pages=101–135 |doi=10.1016/j.matpur.2021.07.009 |arxiv=2103.00502 |s2cid = 232075797 }}</ref> If, moreover, the target function <math>f</math> is smooth, then the required number of layer and their width can be exponentially smaller.<ref>{{cite journal |last1=Lu |first1=Jianfeng |last2=Shen |first2=Zuowei |last3=Yang |first3=Haizhao |last4=Zhang |first4=Shijun |title=Deep Network Approximation for Smooth Functions |journal = SIAM Journal on Mathematical Analysis |date=January 2021 |volume=53 |issue=5 |pages=5465–5506 |doi=10.1137/20M134695X |arxiv=2001.03040 |s2cid=210116459 }}</ref> Even if <math>f</math> is not smooth, the curse of dimensionality can be broken if <math>f</math> admits additional "compositional structure".<ref>{{Cite journal |last1=Juditsky |first1=Anatoli B. |last2=Lepski |first2=Oleg V. |last3=Tsybakov |first3=Alexandre B. |date=2009-06-01 |title=Nonparametric estimation of composite functions |journal=The Annals of Statistics |volume=37 |issue=3 |doi=10.1214/08-aos611 |s2cid=2471890 |issn=0090-5364|doi-access=free }}</ref><ref>{{Cite journal |last1=Poggio |first1=Tomaso |last2=Mhaskar |first2=Hrushikesh |last3=Rosasco |first3=Lorenzo |last4=Miranda |first4=Brando |last5=Liao |first5=Qianli |date=2017-03-14 |title=Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review |journal=International Journal of Automation and Computing |volume=14 |issue=5 |pages=503–519 |doi=10.1007/s11633-017-1054-2 |s2cid=15562587 |issn=1476-8186|doi-access=free |arxiv=1611.00740 }}</ref>
}}
</blockquote>
 
Together, the central result of<ref name=kidger /> yields the following universal approximation theorem for networks with bounded width (see also<ref name=gripenberg /> for the first result of this kind).
 
{{math theorem
<blockquote>
'''| name = Universal approximation theorem''' (Uniform non-[[Affine transformation|affine]] activation, arbitrary [[Deep learning|depth]], constrained width).
| math_statement = Let <math>\mathcal{X}</math> be a [[Compact set|compact subset]] of <math>\mathbb{R}^d</math>. Let <math>\sigma:\mathbb{R} \to \mathbb{R}</math> be any non-[[Affine transformation|affine]] [[Continuous function|continuous]] function which is [[Differentiable function#Differentiability classes|continuously differentiable]] at at least one point, with nonzero [[derivative]] at that point. Let <math>\mathcal{N}_{d,D:d+D+2}^\sigma</math> denote the space of feed-forward neural networks with <math>d</math> input neurons, <math>D</math> output neurons, and an arbitrary number of hidden layers each with <math>d + D + 2</math> neurons, such that every hidden neuron has activation function <math>\sigma</math> and every output neuron has the [[identity function|identity]] as its activation function, with input layer <math>\phi</math> and output layer <math>\rho</math>. Then given any <math>\varepsilon > 0</math> and any <math>f \in C(\mathcal{X}, \mathbb{R}^D)</math>, there exists <math>\hat{f} \in \mathcal{N}_{d,D:d+D+2}^\sigma</math> such that
<math display="block">
 
: <math>
\sup_{x \in \mathcal{X}} \left\|\hat{f}(x) - f(x)\right\| < \varepsilon.
</math>
Line 97 ⟶ 98:
 
''Quantitative refinement:'' The number of layers and the width of each layer required to approximate <math>f</math> to <math>\varepsilon</math> precision known;<ref name="jmlr.org"/> moreover, the result hold true when <math>\mathcal{X}</math> and <math>\mathbb{R}^D</math> are replaced with any non-positively curved [[Riemannian manifold]].
}}
</blockquote>
 
Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.<ref name="ZhouLu" /><ref name=hanin /><ref name=johnson>{{cite conference |last=Johnson |first=Jesse |conference=International Conference on Learning Representations |date=2019 |url=https://openreview.net/forum?id=ryGgSsAcFQ |title=Deep, Skinny Neural Networks are not Universal Approximators}}</ref>