Universal approximation theorem: Difference between revisions

Content deleted Content added
Added note on universal function approximation with non-monotonic activation functions and discovery of biological neurons that violate the monotonicity assumption.
reformatted refs
Line 4:
[[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)|vectors]] to real-valued [[vector (mathematics and physics)|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.
 
In the [[mathematics|mathematical]] theory of [[artificial neural networks]], '''universal approximation theorems''' are results<ref name=MLP-UA>{{Citecite journal conference|last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert|date=1989 |title=Multilayer Feedforwardfeedforward Networksnetworks are Universaluniversal Approximators|url=http://cognitivemedium.com/magic_paper/assets/Hornik.pdf|publisher=Pergamonapproximators Press|journal=Neural Networks |date=January 1989 |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8 }}</ref><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> that put limits on what neural networks can theoretically learn. Specifically, given an algorithm that generates the networks within a class of functions, the theorems establish the [[dense set|density]] of the generated functions within a given function space of interest. Typically, these results concern the approximation capabilities of the [[feedforward neural network|feedforward architecture]] on the space of continuous functions between two [[Euclidean space]]s, and the approximation is with respect to the [[compact convergence]] topology. What must be stressed, is that while some functions can be arbitrarily well approximated in a region, the proofs do not apply outside of the region, i.e. the approximated functions do not [[extrapolate]] outside of the region. That applies for all non-periodic [[activation function]]s, i.e. what's in practice used and most proofs assume. In recent years [https://doi.org/10.1126/science.aax6239 neocortical pyramidal neurons] with oscillating activation function that can individually learn the XOR function have been discovered in the human brain and oscillating activation functions have been explored and shown to outperform popular activation functions on a variety of benchmarks.<ref>{{cite journal |last1=Gidon |first1=Albert |last2=Zolnik |first2=Timothy Adam |last3=Fidzinski |first3=Pawel |last4=Bolduan |first4=Felix |last5=Papoutsi |first5=Athanasia |last6=Poirazi |first6=Panayiota |last7=Holtkamp |first7=Martin |last8=Vida |first8=Imre |last9=Larkum |first9=Matthew Evan |title=Dendritic action potentials and computation in human layer 2/3 cortical neurons |journal=Science |date=3 January 2020 |volume=367 |issue=6473 |pages=83–87 |doi=10.1126/science.aax6239 }}</ref>
 
However, there are also a variety of results between [[non-Euclidean space]]s<ref name=NonEuclidean>{{Cite conference|last1=Kratsios|first1=Anastasis|last2=Bilokopytov|first2=Eugene|date=2020|title=Non-Euclidean Universal Approximation|url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf|publisher=Curran Associates|journal=Advances in Neural Information Processing Systems |volume=33}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |doi=10.1016/j.acha.2019.06.004 |arxiv=1805.10769|title=Universality of deep convolutional neural networks|year=2020|last1=Zhou|first1=Ding-Xuan|journal=[[Applied and Computational Harmonic Analysis]]|volume=48|issue=2|pages=787–794|s2cid=44113176}}</ref><ref>{{Cite journal|doi = 10.1109/LSP.2020.3005051|title = Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets|year = 2020|last1 = Heinecke|first1 = Andreas|last2 = Ho|first2 = Jinn|last3 = Hwang|first3 = Wen-Liang|journal = IEEE Signal Processing Letters|volume = 27|pages = 1175–1179|bibcode = 2020ISPL...27.1175H|s2cid = 220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal|doi=10.1162/neco.1991.3.2.246|title=Universal Approximation Using Radial-Basis-Function Networks|year=1991|last1=Park|first1=J.|last2=Sandberg|first2=I. W.|journal=Neural Computation|volume=3|issue=2|pages=246–257|pmid=31167308|s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |doi=10.1007/s00365-021-09546-1|arxiv=1804.10306|title=Universal Approximations of Invariant Maps by Neural Networks|year=2021|last1=Yarotsky|first1=Dmitry|journal=Constructive Approximation|volume=55 |pages=407–474 |s2cid=13745401}}</ref><ref>{{Citecite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |date=2023 |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |url=https://ieeexplore.ieee.org/document/10159005 |journal=IEEE Control Systems Letters |volumedate=72023 |pages=2689–2694 |arxiv=2303.12147 |doi=10.1109/LCSYS.2023.3288350 |arxiv=2303.12147 |s2cid=257663609 |issn=2475-1456}}</ref> Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
 
Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal).
Line 15:
 
== History ==
One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function|sigmoid]] activation functions.<ref name=cyb>{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name=horn>{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name=leshno>{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name=pinkus>{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{Citecite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |date=January 2022 |title=Optimal approximation rate of ReLU networks in terms of width and depth |url=https://linkinghub.elsevier.com/retrieve/pii/S0021782421001124 |journal=Journal de Mathématiques Pures et Appliquées |languagedate=enJanuary 2022 |volume=157 |pages=101–135 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797 |arxiv=2103.00502 }}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
 
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal|last1=Gripenberg|first1=Gustaf|date=June 2003|title= Approximation by neural networks with a bounded number of nodes at each level|journal= Journal of Approximation Theory |volume=122|issue=2|pages=260–266|doi= 10.1016/S0021-9045(03)00078-9 |doi-access=}}</ref> Dmitry Yarotsky,<ref>{{Citecite bookjournal |first=Dmitry |lastlast1=Yarotsky |urlfirst1=http://worldcat.org/oclc/1106247665Dmitry |title=Error bounds for approximations with deep ReLU networks |journal=Neural Networks |date=2016-10-03October 2017 |oclcvolume=110624766594 |pages=103–114 |doi=10.1016/j.neunet.2017.07.002 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal |last1=Lu |first1=Zhou |last2=Pu |first2=Hongming |last3=Wang |first3=Feicheng |last4=Hu |first4=Zhiqiang |last5=Wang |first5=Liwei |title=The Expressive Power of Neural Networks: A View from the Width |journal=Advances in Neural Information Processing Systems |volume=30 |year=2017 |pages=6231–6239 |url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width |publisher=Curran Associates |arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv |last1=Hanin|first1=Boris|last2=Sellke|first2=Mark|title=Approximating Continuous Functions by ReLU Nets of Minimal Width|eprint=1710.11278|class=stat.ML|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference|last1=Kidger|first1=Patrick|last2=Lyons|first2=Terry|date=July 2020|title=Universal Approximation with Deep Narrow Networks|arxiv=1905.08539|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal |last1=Kratsios |first1=Anastasis |last2=Papon |first2=Léonie |date=2022 |title=Universal Approximation Theorems for Differentiable Geometric Deep Learning |url=http://jmlr.org/papers/v23/21-0716.html |journal=Journal of Machine Learning Research |volume=23 |issue=196 |pages=1–73 |arxiv=2101.05390 |issn=1533-7928}}</ref> who derived explicit depth estimates depending on the regularity of the target function and of the activation function.
 
The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)|ReLU]] as activation functions.<ref name="park">{{Cite conference |last1=Park |first1=Sejun |last2=Yun |first2=Chulhee |last3=Lee |first3=Jaeho |last4=Shin |first4=Jinwoo |date=2021 |title=Minimum Width for Universal Approximation |conference=International Conference on Learning Representations |arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory|control-theoretic]] arguments.<ref>{{Cite conference |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |date=2021 |title=Universal approximation power of deep residual neural networks via nonlinear control theory |conference=International Conference on Learning Representations |arxiv=2007.06007}}</ref><ref>{{Citecite journal |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |date=2023 |title=Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control |url=https://ieeexplore.ieee.org/document/9827563 |journal=IEEE Transactions on Automatic Control |date=May 2023 |volume=68 |issue=5 |pages=2715–2728 |doi=10.1109/TAC.2022.3190051 |s2cid=250512115 |issn=1558-2523}}</ref> In 2023, Cai<ref name=":1">{{Cite journal |last=Cai |first=Yongqiang |date=2023-02-01 |title=Achieve the Minimum Width of Neural Networks for Universal Approximation |url=https://openreview.net/forum?id=hfUJ4ShyDEU |journal=ICLR |arxiv=2209.11395 |language=en}}</ref> obtained the optimal minimum width bound for the universal approximation.
 
The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.<ref name=maiorov>{{Cite journal|last1=Maiorov|first1=Vitaly|last2=Pinkus|first2=Allan|date=April 1999|title=Lower bounds for approximation by MLP neural networks|journal=Neurocomputing|volume=25|issue=1–3|pages=81–91|doi=10.1016/S0925-2312(98)00111-8}}</ref> They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators.
Line 27:
random neural networks,<ref>{{Cite journal|doi=10.1109/72.737488|title=Function approximation with spiked random networks|year=1999|last1=Gelenbe|first1=Erol|last2=Mao|first2= Zhi Hong|last3=Li|first3=Yan D.|journal=IEEE Transactions on Neural Networks|volume=10|issue=1|pages=3–9|pmid=18252498 |url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference|last1=Lin|first1=Hongzhou|last2=Jegelka|first2=Stefanie|date=2018|title=ResNet with one-neuron hidden layers is a Universal Approximator|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator|publisher=Curran Associates|pages=6169–6178|journal=Advances in Neural Information Processing Systems |volume=30}}</ref>
 
In 2023 it was published that a three-layer neural network can approximate any function (continuous and not continuous) <ref>{{cite journal |last1=Ismailov |first1=Vugar E. |title=A three layer neural network can represent any multivariate function, Vugar|journal=Journal E.of Ismailov,Mathematical https:/Analysis and Applications |date=July 2023 |volume=523 |issue=1 |pages=127096 |doi=10.1016/wwwj.sciencedirectjmaa.com/science/article/abs/pii/S0022247X230009992023.127096 }}</ref>
 
== Arbitrary-width case ==
A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill|Kurt Hornik|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{Citecite journal |lastlast1=Funahashi |firstfirst1=Ken-Ichi |date=1989-01-01 |title=On the approximate realization of continuous mappings by neural networks |url=https://dx.doi.org/10.1016/0893-6080%2889%2990003-8 |journal=Neural Networks |languagedate=enJanuary 1989 |volume=2 |issue=3 |pages=183–192 |doi=10.1016/0893-6080(89)90003-8 |issn=0893-6080}}</ref><ref name=cyb /><ref name=":0">{{Citecite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |date=1989-01-01 |title=Multilayer feedforward networks are universal approximators |url=https://dx.doi.org/10.1016/0893-6080%2889%2990020-8 |journal=Neural Networks |languagedate=enJanuary 1989 |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8 |s2cid=2757547 |issn=0893-6080}}</ref><ref name=horn /> See<ref>Haykin, Simon (1998). ''Neural Networks: A Comprehensive Foundation'', Volume 2, Prentice Hall. {{isbn|0-13-273350-1}}.</ref><ref>Hassoun, M. (1995) ''Fundamentals of Artificial Neural Networks'' MIT Press, p.&nbsp;48</ref><ref name="pinkus" /> for reviews. The following is the most often quoted:{{math_theorem
| name = Universal approximation theorem|Let <math>C(X, \mathbb{R}^m)</math> denote the set of [[continuous functions]] from a subset <math>X </math> of a Euclidean <math>\mathbb{R}^n</math> space to a Euclidean space <math>\mathbb{R}^m</math>. Let <math>\sigma \in C(\mathbb{R}, \mathbb{R})</math>. Note that <math>(\sigma \circ x)_i = \sigma(x_i)</math>, so <math>\sigma \circ x</math> denotes <math>\sigma</math> applied to each component of <math>x</math>.
 
Line 82:
Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact ___domain, then the exact minimum width is<ref name=":1" /> <math>d_m = \max\{n, m, 2\}</math>.
 
''Quantitative refinement:'' In the case where, when <math>\mathcal{X} = [0, 1]^d</math> and <math>D = 1</math> and where <math>\sigma</math> is the [[Rectifier (neural networks)|ReLU activation function]], the exact depth and width for a ReLU network to achieve <math>\varepsilon</math> error is also known.<ref>{{Citecite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |date=2022-01-01 |title=Optimal approximation rate of ReLU networks in terms of width and depth |url=https://www.sciencedirect.com/science/article/pii/S0021782421001124 |journal=Journal de Mathématiques Pures et Appliquées |languagedate=enJanuary 2022 |volume=157 |pages=101–135 |doi=10.1016/j.matpur.2021.07.009 |arxiv=2103.00502 |s2cid=232075797 |issn=0021-7824}}</ref> If, moreover, the target function <math>f</math> is smooth, then the required number of layer and their width can be exponentially smaller.<ref>{{Citecite journal |last1=Lu |first1=Jianfeng |last2=Shen |first2=Zuowei |last3=Yang |first3=Haizhao |last4=Zhang |first4=Shijun |date=2021-01-01 |title=Deep Network Approximation for Smooth Functions |url=https://epubs.siam.org/doi/abs/10.1137/20M134695X |journal=SIAM Journal on Mathematical Analysis |date=January 2021 |volume=53 |issue=5 |pages=5465–5506 |doi=10.1137/20M134695X |arxiv=2001.03040 |s2cid=210116459 |issn=0036-1410}}</ref> Even if <math>f</math> is not smooth, the curse of dimensionality can be broken if <math>f</math> admits additional "compositional structure".<ref>{{Cite journal |last1=Juditsky |first1=Anatoli B. |last2=Lepski |first2=Oleg V. |last3=Tsybakov |first3=Alexandre B. |date=2009-06-01 |title=Nonparametric estimation of composite functions |journal=The Annals of Statistics |volume=37 |issue=3 |doi=10.1214/08-aos611 |s2cid=2471890 |issn=0090-5364|doi-access=free }}</ref><ref>{{Cite journal |last1=Poggio |first1=Tomaso |last2=Mhaskar |first2=Hrushikesh |last3=Rosasco |first3=Lorenzo |last4=Miranda |first4=Brando |last5=Liao |first5=Qianli |date=2017-03-14 |title=Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review |journal=International Journal of Automation and Computing |volume=14 |issue=5 |pages=503–519 |doi=10.1007/s11633-017-1054-2 |s2cid=15562587 |issn=1476-8186|doi-access=free |arxiv=1611.00740 }}</ref>
</blockquote>