Universal approximation theorem: Difference between revisions

Content deleted Content added
cleaning up lead section
Line 12:
== Setup ==
[[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)|vectors]] to real-valued [[vector (mathematics and physics)|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.
 
Most universal approximation theorems canare bein parsedone intoof two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
 
== History ==
 
=== Classical results ===
One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function|sigmoid]] activation functions.<ref name="cyb">{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name="horn">{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name="leshno">{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name="pinkus">{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |date=January 2022 |volume=157 |pages=101–135 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797 |arxiv=2103.00502 }}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
 
What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal).
 
=== Modern results ===
Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
 
There are also a variety of results between [[non-Euclidean space]]s<ref name="NonEuclidean">{{Cite conference |last1=Kratsios |first1=Anastasis |last2=Bilokopytov |first2=Eugene |date=2020 |title=Non-Euclidean Universal Approximation |url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |last1=Zhou |first1=Ding-Xuan |year=2020 |title=Universality of deep convolutional neural networks |journal=[[Applied and Computational Harmonic Analysis]] |volume=48 |issue=2 |pages=787–794 |arxiv=1805.10769 |doi=10.1016/j.acha.2019.06.004 |s2cid=44113176}}</ref><ref>{{Cite journal |last1=Heinecke |first1=Andreas |last2=Ho |first2=Jinn |last3=Hwang |first3=Wen-Liang |year=2020 |title=Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets |journal=IEEE Signal Processing Letters |volume=27 |pages=1175–1179 |bibcode=2020ISPL...27.1175H |doi=10.1109/LSP.2020.3005051 |s2cid=220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal |last1=Park |first1=J. |last2=Sandberg |first2=I. W. |year=1991 |title=Universal Approximation Using Radial-Basis-Function Networks |journal=Neural Computation |volume=3 |issue=2 |pages=246–257 |doi=10.1162/neco.1991.3.2.246 |pmid=31167308 |s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |year=2021 |title=Universal Approximations of Invariant Maps by Neural Networks |journal=Constructive Approximation |volume=55 |pages=407–474 |arxiv=1804.10306 |doi=10.1007/s00365-021-09546-1 |s2cid=13745401}}</ref><ref>{{cite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |date=2023 |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |journal=IEEE Control Systems Letters |page=1 |arxiv=2303.12147 |doi=10.1109/LCSYS.2023.3288350 |s2cid=257663609}}</ref>
 
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal|last1=Gripenberg|first1=Gustaf|date=June 2003|title= Approximation by neural networks with a bounded number of nodes at each level|journal= Journal of Approximation Theory |volume=122|issue=2|pages=260–266|doi= 10.1016/S0021-9045(03)00078-9 |doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |title=Error bounds for approximations with deep ReLU networks |journal=Neural Networks |date=October 2017 |volume=94 |pages=103–114 |doi=10.1016/j.neunet.2017.07.002 |pmid=28756334 |arxiv=1610.01145 |s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal |last1=Lu |first1=Zhou |last2=Pu |first2=Hongming |last3=Wang |first3=Feicheng |last4=Hu |first4=Zhiqiang |last5=Wang |first5=Liwei |title=The Expressive Power of Neural Networks: A View from the Width |journal=Advances in Neural Information Processing Systems |volume=30 |year=2017 |pages=6231–6239 |url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width |publisher=Curran Associates |arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv |last1=Hanin|first1=Boris|last2=Sellke|first2=Mark|title=Approximating Continuous Functions by ReLU Nets of Minimal Width|eprint=1710.11278|class=stat.ML|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference|last1=Kidger|first1=Patrick|last2=Lyons|first2=Terry|date=July 2020|title=Universal Approximation with Deep Narrow Networks|arxiv=1905.08539|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal |last1=Kratsios |first1=Anastasis |last2=Papon |first2=Léonie |date=2022 |title=Universal Approximation Theorems for Differentiable Geometric Deep Learning |url=http://jmlr.org/papers/v23/21-0716.html |journal=Journal of Machine Learning Research |volume=23 |issue=196 |pages=1–73 |arxiv=2101.05390 }}</ref> who derived explicit depth estimates depending on the regularity of the target function and of the activation function.
 
Line 35 ⟶ 31:
random neural networks,<ref>{{Cite journal|doi=10.1109/72.737488|title=Function approximation with spiked random networks|year=1999|last1=Gelenbe|first1=Erol|last2=Mao|first2= Zhi Hong|last3=Li|first3=Yan D.|journal=IEEE Transactions on Neural Networks|volume=10|issue=1|pages=3–9|pmid=18252498 |url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference|last1=Lin|first1=Hongzhou|last2=Jegelka|first2=Stefanie|date=2018|title=ResNet with one-neuron hidden layers is a Universal Approximator|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator|publisher=Curran Associates|pages=6169–6178|journal=Advances in Neural Information Processing Systems |volume=30}}</ref>
 
A three-layer neural network can approximate any function (''continuous'' and ''discontinuous'').<ref>{{cite journal |last1=Ismailov |first1=Vugar E. |date=July 2023 |title=A three layer neural network can represent any multivariate function |journal=Journal of Mathematical Analysis and Applications |date=July 2023 |volume=523 |issue=1 |pages=127096 |arxiv=2012.03016 |doi=10.1016/j.jmaa.2023.127096 |arxiv=2012.03016 |s2cid=265100963 }}</ref> showed that a three-layer neural network can approximate any function (''continuous'' and ''discontinuous'').
 
<ref>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |date=January 2022 |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |volume=157 |pages=101–135 |arxiv=2103.00502 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797}}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
 
There are also a variety of results between [[non-Euclidean space]]s<ref name="NonEuclidean">{{Cite conference |last1=Kratsios |first1=Anastasis |last2=Bilokopytov |first2=Eugene |date=2020 |title=Non-Euclidean Universal Approximation |url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |last1=Zhou |first1=Ding-Xuan |year=2020 |title=Universality of deep convolutional neural networks |journal=[[Applied and Computational Harmonic Analysis]] |volume=48 |issue=2 |pages=787–794 |arxiv=1805.10769 |doi=10.1016/j.acha.2019.06.004 |s2cid=44113176}}</ref><ref>{{Cite journal |last1=Heinecke |first1=Andreas |last2=Ho |first2=Jinn |last3=Hwang |first3=Wen-Liang |year=2020 |title=Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets |journal=IEEE Signal Processing Letters |volume=27 |pages=1175–1179 |bibcode=2020ISPL...27.1175H |doi=10.1109/LSP.2020.3005051 |s2cid=220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal |last1=Park |first1=J. |last2=Sandberg |first2=I. W. |year=1991 |title=Universal Approximation Using Radial-Basis-Function Networks |journal=Neural Computation |volume=3 |issue=2 |pages=246–257 |doi=10.1162/neco.1991.3.2.246 |pmid=31167308 |s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |year=2021 |title=Universal Approximations of Invariant Maps by Neural Networks |journal=Constructive Approximation |volume=55 |pages=407–474 |arxiv=1804.10306 |doi=10.1007/s00365-021-09546-1 |s2cid=13745401}}</ref><ref>{{cite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |date=2023 |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |journal=IEEE Control Systems Letters |page=1 |arxiv=2303.12147 |doi=10.1109/LCSYS.2023.3288350 |s2cid=257663609}}</ref>
 
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy the minimum width required for the universal approximation of the ''[[Lp space|L<sup>p</sup>]]'' functions is exactly max{dx + 1, dy} (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate C(K, R dy) in general while maintaining width max{dx + 1, dy}? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function|threshold activation function]] are used.<ref name="park" />