Content deleted Content added
Added note on universal function approximation with non-monotonic activation functions and discovery of biological neurons that violate the monotonicity assumption. |
reformatted refs |
||
Line 4:
[[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)|vectors]] to real-valued [[vector (mathematics and physics)|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.
In the [[mathematics|mathematical]] theory of [[artificial neural networks]], '''universal approximation theorems''' are results<ref name=MLP-UA>{{
However, there are also a variety of results between [[non-Euclidean space]]s<ref name=NonEuclidean>{{Cite conference|last1=Kratsios|first1=Anastasis|last2=Bilokopytov|first2=Eugene|date=2020|title=Non-Euclidean Universal Approximation|url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf|publisher=Curran Associates|journal=Advances in Neural Information Processing Systems |volume=33}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |doi=10.1016/j.acha.2019.06.004 |arxiv=1805.10769|title=Universality of deep convolutional neural networks|year=2020|last1=Zhou|first1=Ding-Xuan|journal=[[Applied and Computational Harmonic Analysis]]|volume=48|issue=2|pages=787–794|s2cid=44113176}}</ref><ref>{{Cite journal|doi = 10.1109/LSP.2020.3005051|title = Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets|year = 2020|last1 = Heinecke|first1 = Andreas|last2 = Ho|first2 = Jinn|last3 = Hwang|first3 = Wen-Liang|journal = IEEE Signal Processing Letters|volume = 27|pages = 1175–1179|bibcode = 2020ISPL...27.1175H|s2cid = 220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal|doi=10.1162/neco.1991.3.2.246|title=Universal Approximation Using Radial-Basis-Function Networks|year=1991|last1=Park|first1=J.|last2=Sandberg|first2=I. W.|journal=Neural Computation|volume=3|issue=2|pages=246–257|pmid=31167308|s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |doi=10.1007/s00365-021-09546-1|arxiv=1804.10306|title=Universal Approximations of Invariant Maps by Neural Networks|year=2021|last1=Yarotsky|first1=Dmitry|journal=Constructive Approximation|volume=55 |pages=407–474 |s2cid=13745401}}</ref><ref>{{
Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal).
Line 15:
== History ==
One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function|sigmoid]] activation functions.<ref name=cyb>{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name=horn>{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name=leshno>{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name=pinkus>{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal|last1=Gripenberg|first1=Gustaf|date=June 2003|title= Approximation by neural networks with a bounded number of nodes at each level|journal= Journal of Approximation Theory |volume=122|issue=2|pages=260–266|doi= 10.1016/S0021-9045(03)00078-9 |doi-access=}}</ref> Dmitry Yarotsky,<ref>{{
The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)|ReLU]] as activation functions.<ref name="park">{{Cite conference |last1=Park |first1=Sejun |last2=Yun |first2=Chulhee |last3=Lee |first3=Jaeho |last4=Shin |first4=Jinwoo |date=2021 |title=Minimum Width for Universal Approximation |conference=International Conference on Learning Representations |arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory|control-theoretic]] arguments.<ref>{{Cite conference |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |date=2021 |title=Universal approximation power of deep residual neural networks via nonlinear control theory |conference=International Conference on Learning Representations |arxiv=2007.06007}}</ref><ref>{{
The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.<ref name=maiorov>{{Cite journal|last1=Maiorov|first1=Vitaly|last2=Pinkus|first2=Allan|date=April 1999|title=Lower bounds for approximation by MLP neural networks|journal=Neurocomputing|volume=25|issue=1–3|pages=81–91|doi=10.1016/S0925-2312(98)00111-8}}</ref> They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators.
Line 27:
random neural networks,<ref>{{Cite journal|doi=10.1109/72.737488|title=Function approximation with spiked random networks|year=1999|last1=Gelenbe|first1=Erol|last2=Mao|first2= Zhi Hong|last3=Li|first3=Yan D.|journal=IEEE Transactions on Neural Networks|volume=10|issue=1|pages=3–9|pmid=18252498 |url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference|last1=Lin|first1=Hongzhou|last2=Jegelka|first2=Stefanie|date=2018|title=ResNet with one-neuron hidden layers is a Universal Approximator|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator|publisher=Curran Associates|pages=6169–6178|journal=Advances in Neural Information Processing Systems |volume=30}}</ref>
In 2023 it was published that a three-layer neural network can approximate any function (continuous and not continuous)
== Arbitrary-width case ==
A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill|Kurt Hornik|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{
| name = Universal approximation theorem|Let <math>C(X, \mathbb{R}^m)</math> denote the set of [[continuous functions]] from a subset <math>X </math> of a Euclidean <math>\mathbb{R}^n</math> space to a Euclidean space <math>\mathbb{R}^m</math>. Let <math>\sigma \in C(\mathbb{R}, \mathbb{R})</math>. Note that <math>(\sigma \circ x)_i = \sigma(x_i)</math>, so <math>\sigma \circ x</math> denotes <math>\sigma</math> applied to each component of <math>x</math>.
Line 82:
Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact ___domain, then the exact minimum width is<ref name=":1" /> <math>d_m = \max\{n, m, 2\}</math>.
''Quantitative refinement:'' In the case where, when <math>\mathcal{X} = [0, 1]^d</math> and <math>D = 1</math> and where <math>\sigma</math> is the [[Rectifier (neural networks)|ReLU activation function]], the exact depth and width for a ReLU network to achieve <math>\varepsilon</math> error is also known.<ref>{{
</blockquote>
|