Content deleted Content added
No edit summary |
mNo edit summary |
||
(14 intermediate revisions by 10 users not shown) | |||
Line 1:
{{Short description|
{{Technical|date=July 2023}}
In the field of [[machine learning]], the '''universal approximation theorems''' state that [[Artificial neural network|neural networks]] with a certain structure can, in principle, approximate any [[continuous function]] to any desired degree of accuracy. These theorems provide a mathematical justification for using neural networks, assuring researchers that a sufficiently large or deep network can model the complex, non-linear relationships often found in real-world data.<ref name="MLP-UA" /><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref>
The most well-known version of the theorem applies to [[Feedforward neural network|feedforward networks]] with a single hidden layer. It states that if the layer's [[activation function]] is non-[[polynomial]] (which is true for common choices like the [[sigmoid function]] or [[Rectifier (neural networks)|ReLU]]), then the network can act as a "universal approximator." Universality is achieved by increasing the number of neurons in the hidden layer, making the network "wider." Other versions of the theorem show that universality can also be achieved by keeping the network's width fixed but increasing its number of layers, making it "deeper."
It is important to note that these are [[Existence theorem|existence theorems]]. They guarantee that a network with the right structure ''exists'', but they do not provide a method for finding the network's parameters ([[Mathematical optimization|training]] it), nor do they specify exactly how large the network must be for a given function. Finding a suitable network remains a practical challenge that is typically addressed with optimization algorithms like [[backpropagation]].
== Setup ==
Line 18 ⟶ 16:
=== Arbitrary width ===
The first examples were the ''arbitrary width'' case. [[George Cybenko]] in 1989 proved it for [[sigmoid function|sigmoid]] activation functions.<ref name="cyb">{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|bibcode=1989MCSS....2..303C |s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA">{{cite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |date=January 1989 |title=Multilayer feedforward networks are universal approximators |journal=Neural Networks |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8}}</ref> Hornik also showed in 1991<ref name="horn">{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name="leshno">{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name="pinkus">{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function.
=== Arbitrary depth ===
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal|last1=Gripenberg|first1=Gustaf|date=June 2003|title= Approximation by neural networks with a bounded number of nodes at each level|journal= Journal of Approximation Theory |volume=122|issue=2|pages=260–266|doi= 10.1016/S0021-9045(03)00078-9 |doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |title=Error bounds for approximations with deep ReLU networks |journal=Neural Networks |date=October 2017 |volume=94 |pages=103–114 |doi=10.1016/j.neunet.2017.07.002 |pmid=28756334 |arxiv=1610.01145 |s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal |last1=Lu |first1=Zhou |last2=Pu |first2=Hongming |last3=Wang |first3=Feicheng |last4=Hu |first4=Zhiqiang |last5=Wang |first5=Liwei |title=The Expressive Power of Neural Networks: A View from the Width |journal=Advances in Neural Information Processing Systems |volume=30 |year=2017 |pages=6231–6239 |url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width |publisher=Curran Associates |arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv |last1=Hanin|first1=Boris|last2=Sellke|first2=Mark|title=Approximating Continuous Functions by ReLU Nets of Minimal Width|eprint=1710.11278|class=stat.ML|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference|last1=Kidger|first1=Patrick|last2=Lyons|first2=Terry|date=July 2020|title=Universal Approximation with Deep Narrow Networks|arxiv=1905.08539|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh
One special case of arbitrary depth is that each composition component comes from a finite set of mappings. In 2024, Cai <ref name= cai2024 >{{Cite journal|last1=Yongqiang|first1=Cai|date=2024|title= Vocabulary for Universal Approximation: A Linguistic Perspective of Mapping Compositions|journal= ICML|pages=5189–5208 |arxiv=2305.12205 |url= https://proceedings.mlr.press/v235/cai24a.html}}</ref> constructed a finite set of mappings, named a vocabulary, such that any continuous function can be approximated by compositing a sequence from the vocabulary. This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings.
Line 30 ⟶ 28:
In 2018, Guliyev and Ismailov<ref name="guliyev1">{{Cite journal |last1=Guliyev |first1=Namig |last2=Ismailov |first2=Vugar |date=November 2018 |title=Approximation capability of two hidden layer feedforward neural networks with fixed weights |journal=Neurocomputing |volume=316 |pages=262–269 |arxiv=2101.09181 |doi=10.1016/j.neucom.2018.07.075 |s2cid=52285996}}</ref> constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. In 2018, they also constructed<ref name="guliyev2">{{Cite journal|last1=Guliyev|first1=Namig|last2=Ismailov|first2=Vugar|date=February 2018|title=On the approximation by single hidden layer feedforward neural networks with fixed weights|journal=Neural Networks|volume=98| pages=296–304|doi=10.1016/j.neunet.2017.12.007|pmid=29301110 |arxiv=1708.06219 |s2cid=4932839 }}</ref> single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions.
In 2022, Shen ''et al.''<ref name=shen22>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |date=January 2022 |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |volume=157 |pages=101–135 |arxiv=2103.00502 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797}}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
=== Quantitative bounds ===
Line 41 ⟶ 39:
=== Reservoir computing and quantum reservoir computing===
In reservoir computing a sparse recurrent neural network with fixed weights equipped of fading memory and echo state property is followed by a trainable output layer. Its universality has been demonstrated separately for what concerns networks of rate neurons <ref>{{Cite journal |
=== Variants ===
Line 47 ⟶ 45:
random neural networks,<ref>{{Cite journal |last1=Gelenbe |first1=Erol |last2=Mao |first2=Zhi Hong |last3=Li |first3=Yan D. |year=1999 |title=Function approximation with spiked random networks |url=https://zenodo.org/record/6817275 |journal=IEEE Transactions on Neural Networks |volume=10 |issue=1 |pages=3–9 |doi=10.1109/72.737488 |pmid=18252498}}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference |last1=Lin |first1=Hongzhou |last2=Jegelka |first2=Stefanie|author2-link=Stefanie Jegelka |date=2018 |title=ResNet with one-neuron hidden layers is a Universal Approximator |url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator |publisher=Curran Associates |volume=30 |pages=6169–6178 |journal=Advances in Neural Information Processing Systems}}</ref>
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension
Universal function approximation on graphs (or rather on [[Graph isomorphism|graph isomorphism classes]]) by popular [[Graph neural network|graph convolutional neural networks]] (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.<ref name="PowerGNNs">{{Cite conference |last1=Xu |first1=Keyulu |last2=Hu |first2=Weihua |last3=Leskovec |first3=Jure |last4=Jegelka |first4=Stefanie|author4-link=Stefanie Jegelka |date=2019 |title=How Powerful are Graph Neural Networks? |url=https://openreview.net/forum?id=ryGs6iA5Km |journal=International Conference on Learning Representations}}</ref> In 2020,<ref name="UniversalGraphs">{{Cite conference |last1=Brüel-Gabrielsson |first1=Rickard |date=2020 |title=Universal Function Approximation on Graphs |url=https://proceedings.neurips.cc//paper/2020/hash/e4acb4c86de9d2d9a41364f93951028d-Abstract.html |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems}}</ref> a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying <math>\mathcal O(\left|V\right| \cdot \left|E\right|)</math>-runtime method that performed at state of the art on a collection of benchmarks (where <math>V</math> and <math>E</math> are the sets of nodes and edges of the graph respectively).
Line 54 ⟶ 52:
== Arbitrary-width case ==
A universal approximation theorem formally states that a family of neural network functions is a [[dense set]] within a larger space of functions they are intended to approximate. In more direct terms, for any function <math>f</math> from a given function space, there exists a sequence of neural networks <math>\phi_1, \phi_2, \dots</math> from the family, such that <math>\phi_n \to f</math> according to some criterion.<ref name="cyb" /><ref name="MLP-UA" />
A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill|Kurt Hornik|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{cite journal |last1=Funahashi |first1=Ken-Ichi |title=On the approximate realization of continuous mappings by neural networks |journal=Neural Networks |date=January 1989 |volume=2 |issue=3 |pages=183–192 |doi=10.1016/0893-6080(89)90003-8 }}</ref><ref name=cyb /><ref name=":0">{{cite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |title=Multilayer feedforward networks are universal approximators |journal=Neural Networks |date=January 1989 |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8 }}</ref><ref name=horn /> See<ref>Haykin, Simon (1998). ''Neural Networks: A Comprehensive Foundation'', Volume 2, Prentice Hall. {{isbn|0-13-273350-1}}.</ref><ref>Hassoun, M. (1995) ''Fundamentals of Artificial Neural Networks'' MIT Press, p. 48</ref><ref name="pinkus" /> for reviews. The following is the most often quoted:{{math_theorem▼
▲A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill|Kurt Hornik|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{cite journal |last1=Funahashi |first1=Ken-Ichi |title=On the approximate realization of continuous mappings by neural networks |journal=Neural Networks |date=January 1989 |volume=2 |issue=3 |pages=183–192 |doi=10.1016/0893-6080(89)90003-8 }}</ref><ref name=
| name = Universal approximation theorem|Let <math>C(X, \mathbb{R}^m)</math> denote the set of [[continuous functions]] from a subset <math>X </math> of a Euclidean <math>\mathbb{R}^n</math> space to a Euclidean space <math>\mathbb{R}^m</math>. Let <math>\sigma \in C(\mathbb{R}, \mathbb{R})</math>. Note that <math>(\sigma \circ x)_i = \sigma(x_i)</math>, so <math>\sigma \circ x</math> denotes <math>\sigma</math> applied to each component of <math>x</math>.
Line 91:
The case where <math>\sigma</math> is a generic non-polynomial function is harder, and the reader is directed to.<ref name="pinkus" />}}
The above proof has not specified how one might use a ramp function to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>. A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the [[Dirac delta function]], then use those to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>.<ref>{{Cite book |last=Nielsen |first=Michael A. |date=2015 |title=Neural Networks and Deep Learning |url=http://neuralnetworksanddeeplearning.com/ |language=en}}</ref> The original proofs, such as the one by Cybenko, use methods from functional analysis, including the [[Hahn–Banach theorem|Hahn-Banach]] and [[Riesz–Markov–Kakutani representation theorem|Riesz–Markov–Kakutani representation]] theorems. Cybenko first published the theorem in a technical report in 1988,<ref>G. Cybenko, "Continuous Valued Neural Networks with Two Hidden Layers are Sufficient", Technical Report, Department of Computer Science, Tufts University, 1988.</ref> then as a paper in 1989.<ref name="cyb" />
Notice also that the neural network is only required to approximate within a compact set <math>K</math>. The proof does not describe how the function would be extrapolated outside of the region.
The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization:<ref name="
{{math_theorem
| name = Universal approximation theorem for pi-sigma networks|With any nonconstant activation function, a one-hidden-layer pi-sigma network is a universal approximator.
Line 101:
== Arbitrary-depth case ==
The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017.<ref name=ZhouLu /> They showed that networks of width ''n'' + 4 with [[ReLU]] activation functions can approximate any [[Lebesgue integration|Lebesgue-integrable function]] on ''n''-dimensional input space with respect to [[L1 distance|<math>L^1</math> distance]] if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper<ref name=ZhouLu /> it was shown that [[ReLU]] networks with width ''n'' + 1 were sufficient to approximate any [[continuous function|continuous]] function of ''n''-dimensional input variables.<ref
{{math theorem
Line 111:
Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact ___domain, then the exact minimum width is<ref name=":1" /> <math>d_m = \max\{n, m, 2\}</math>.
''Quantitative refinement:'' In the case where <math>f:[0, 1]^n \rightarrow \mathbb{R} </math>, (i.e. <math> m = 1 </math>) and <math>\sigma</math> is the [[Rectifier (neural networks)|ReLU activation function]], the exact depth and width for a ReLU network to achieve <math>\varepsilon</math> error is also known.<ref
}}
Line 166:
{{Differentiable computing}}
[[Category:Theorems in mathematical analysis]]
[[Category:Artificial neural networks]]
[[Category:Network architecture]]
[[Category:Networks]]
[[Category:Approximation theory]]
|