Universal approximation theorem: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:23, 16 March 2024 edit 202.131.155.130 (talk) No edit summary Tag: Reverted ← Previous edit		Latest revision as of 19:08, 17 August 2025 edit undo Pol098 (talk \| contribs) Extended confirmed users, Pending changes reviewers 123,892 edits mNo edit summary
(46 intermediate revisions by 28 users not shown)
Line 1: {{Short description\|~~Feed-forward~~Property ~~neural~~of ~~network~~artificial ~~with a 1 hidden layer can approximate continuous~~neural ~~functions~~networks}} {{Technical\|date=July 2023}} In the field of [[machine learning]], the '''universal approximation theorems''' state that [[Artificial neural network\|neural networks]] with a certain structure can, in principle, approximate any [[continuous function]] to any desired degree of accuracy. These theorems provide a mathematical justification for using neural networks, assuring researchers that a sufficiently large or deep network can model the complex, non-linear relationships often found in real-world data.<ref name="MLP-UA" /><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> [[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)\|vectors]] to real-valued [[vector (mathematics and physics)\|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces. The most well-known version of the theorem applies to [[Feedforward neural network\|feedforward networks]] with a single hidden layer. It states that if the layer's [[activation function]] is non-[[polynomial]] (which is true for common choices like the [[sigmoid function]] or [[Rectifier (neural networks)\|ReLU]]), then the network can act as a "universal approximator." Universality is achieved by increasing the number of neurons in the hidden layer, making the network "wider." Other versions of the theorem show that universality can also be achieved by keeping the network's width fixed but increasing its number of layers, making it "deeper." In the [[mathematics\|mathematical]] theory of [[artificial neural networks]], '''universal approximation theorems''' are results<ref name=MLP-UA>{{cite journal \|last1=Hornik \|first1=Kurt \|last2=Stinchcombe \|first2=Maxwell \|last3=White \|first3=Halbert \|title=Multilayer feedforward networks are universal approximators \|journal=Neural Networks \|date=January 1989 \|volume=2 \|issue=5 \|pages=359–366 \|doi=10.1016/0893-6080(89)90020-8 }}</ref><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> that put limits on what neural networks can theoretically learn. Specifically, given an algorithm that generates the networks within a class of functions, the theorems establish the [[dense set\|density]] of the generated functions within a given function space of interest. Typically, these results concern the approximation capabilities of the [[feedforward neural network\|feedforward architecture]] on the space of continuous functions between two [[Euclidean space]]s, and the approximation is with respect to the [[compact convergence]] topology. What must be stressed, is that while some functions can be arbitrarily well approximated in a region, the proofs do not apply outside of the region, i.e. the approximated functions do not [[extrapolate]] outside of the region. That applies for all non-periodic [[activation function]]s, i.e. what's in practice used and most proofs assume. In recent years neocortical pyramidal neurons with oscillating activation function that can individually learn the XOR function have been discovered in the human brain and oscillating activation functions have been explored and shown to outperform popular activation functions on a variety of benchmarks.<ref>{{cite journal \|last1=Gidon \|first1=Albert \|last2=Zolnik \|first2=Timothy Adam \|last3=Fidzinski \|first3=Pawel \|last4=Bolduan \|first4=Felix \|last5=Papoutsi \|first5=Athanasia \|last6=Poirazi \|first6=Panayiota \|last7=Holtkamp \|first7=Martin \|last8=Vida \|first8=Imre \|last9=Larkum \|first9=Matthew Evan \|title=Dendritic action potentials and computation in human layer 2/3 cortical neurons \|journal=Science \|date=3 January 2020 \|volume=367 \|issue=6473 \|pages=83–87 \|doi=10.1126/science.aax6239 \|pmid=31896716 \|bibcode=2020Sci...367...83G }}</ref> It is important to note that these are [[Existence theorem\|existence theorems]]. They guarantee that a network with the right structure ''exists'', but they do not provide a method for finding the network's parameters ([[Mathematical optimization\|training]] it), nor do they specify exactly how large the network must be for a given function. Finding a suitable network remains a practical challenge that is typically addressed with optimization algorithms like [[backpropagation]]. However, there are also a variety of results between [[non-Euclidean space]]s<ref name=NonEuclidean>{{Cite conference\|last1=Kratsios\|first1=Anastasis\|last2=Bilokopytov\|first2=Eugene\|date=2020\|title=Non-Euclidean Universal Approximation\|url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf\|publisher=Curran Associates\|journal=Advances in Neural Information Processing Systems \|volume=33}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal \|doi=10.1016/j.acha.2019.06.004 \|arxiv=1805.10769\|title=Universality of deep convolutional neural networks\|year=2020\|last1=Zhou\|first1=Ding-Xuan\|journal=[[Applied and Computational Harmonic Analysis]]\|volume=48\|issue=2\|pages=787–794\|s2cid=44113176}}</ref><ref>{{Cite journal\|doi = 10.1109/LSP.2020.3005051\|title = Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets\|year = 2020\|last1 = Heinecke\|first1 = Andreas\|last2 = Ho\|first2 = Jinn\|last3 = Hwang\|first3 = Wen-Liang\|journal = IEEE Signal Processing Letters\|volume = 27\|pages = 1175–1179\|bibcode = 2020ISPL...27.1175H\|s2cid = 220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal\|doi=10.1162/neco.1991.3.2.246\|title=Universal Approximation Using Radial-Basis-Function Networks\|year=1991\|last1=Park\|first1=J.\|last2=Sandberg\|first2=I. W.\|journal=Neural Computation\|volume=3\|issue=2\|pages=246–257\|pmid=31167308\|s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal \|doi=10.1007/s00365-021-09546-1\|arxiv=1804.10306\|title=Universal Approximations of Invariant Maps by Neural Networks\|year=2021\|last1=Yarotsky\|first1=Dmitry\|journal=Constructive Approximation\|volume=55 \|pages=407–474 \|s2cid=13745401}}</ref><ref>{{cite journal \|last1=Zakwan \|first1=Muhammad \|last2=d’Angelo \|first2=Massimiliano \|last3=Ferrari-Trecate \|first3=Giancarlo \|title=Universal Approximation Property of Hamiltonian Deep Neural Networks \|journal=IEEE Control Systems Letters \|date=2023 \|page=1 \|doi=10.1109/LCSYS.2023.3288350 \|arxiv=2303.12147 \|s2cid=257663609 }}</ref> Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case). == Setup == Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function\|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal). [[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)\|vectors]] to real-valued [[vector (mathematics and physics)\|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces. Most universal approximation theorems are in one of two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case). Neural networks with an unbounded (non-polynomial) activation function have the universal approximation property.<ref>{{Cite journal \|last1=Sonoda \|first1=Sho \|last2=Murata \|first2=Noboru \|date=September 2017 \|title=Neural Network with Unbounded Activation Functions is Universal Approximator \|journal=Applied and Computational Harmonic Analysis \|volume=43 \|issue=2 \|pages=233–268 \|doi=10.1016/j.acha.2015.12.005\|arxiv=1505.03654 \|s2cid=12149203 }}</ref> The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy <!-- In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The --> the minimum width required for the universal approximation of the ''[[Lp space\|L<sup>p</sup>]]'' functions is exactly max{dx + 1, dy} (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate C(K, R dy) in general while maintaining width max{dx + 1, dy}? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function\|threshold activation function]] are used.<ref name="park" /> <!-- Ours (Theorem 4) L p (K, R dy ) conti. nonpoly‡ wmin ≤ max{dx + 2, dy + 1} --> == History == One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function\|sigmoid]] activation functions.<ref name=cyb>{{cite journal \|citeseerx=10.1.1.441.7873 \|doi=10.1007/BF02551274\|title=Approximation by superpositions of a sigmoidal function\|year=1989\|last1=Cybenko\|first1=G.\|journal=Mathematics of Control, Signals, and Systems\|volume=2\|issue=4\|pages=303–314\|s2cid=3958369}}</ref> {{ill\|Kurt Hornik\|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name=horn>{{Cite journal\|doi=10.1016/0893-6080(91)90009-T\|title=Approximation capabilities of multilayer feedforward networks\|year=1991\|last1=Hornik\|first1=Kurt\|journal=Neural Networks\|volume=4\|issue=2\|pages=251–257\|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name=leshno>{{Cite journal\|last1=Leshno\|first1=Moshe\|last2=Lin\|first2=Vladimir Ya.\|last3=Pinkus\|first3=Allan\|last4=Schocken\|first4=Shimon\|date=January 1993\|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function\|journal=Neural Networks\|volume=6\|issue=6\|pages=861–867\|doi=10.1016/S0893-6080(05)80131-5\|s2cid=206089312\|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name=pinkus>{{Cite journal\|last=Pinkus\|first=Allan\|date=January 1999\|title=Approximation theory of the MLP model in neural networks\|journal=Acta Numerica\|volume=8\|pages=143–195\|doi=10.1017/S0962492900002919\|bibcode=1999AcNum...8..143P\|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{cite journal \|last1=Shen \|first1=Zuowei \|last2=Yang \|first2=Haizhao \|last3=Zhang \|first3=Shijun \|title=Optimal approximation rate of ReLU networks in terms of width and depth \|journal=Journal de Mathématiques Pures et Appliquées \|date=January 2022 \|volume=157 \|pages=101–135 \|doi=10.1016/j.matpur.2021.07.009 \|s2cid=232075797 \|arxiv=2103.00502 }}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks. === Arbitrary width === The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal\|last1=Gripenberg\|first1=Gustaf\|date=June 2003\|title= Approximation by neural networks with a bounded number of nodes at each level\|journal= Journal of Approximation Theory \|volume=122\|issue=2\|pages=260–266\|doi= 10.1016/S0021-9045(03)00078-9 \|doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal \|last1=Yarotsky \|first1=Dmitry \|title=Error bounds for approximations with deep ReLU networks \|journal=Neural Networks \|date=October 2017 \|volume=94 \|pages=103–114 \|doi=10.1016/j.neunet.2017.07.002 \|pmid=28756334 \|arxiv=1610.01145 \|s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal \|last1=Lu \|first1=Zhou \|last2=Pu \|first2=Hongming \|last3=Wang \|first3=Feicheng \|last4=Hu \|first4=Zhiqiang \|last5=Wang \|first5=Liwei \|title=The Expressive Power of Neural Networks: A View from the Width \|journal=Advances in Neural Information Processing Systems \|volume=30 \|year=2017 \|pages=6231–6239 \|url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width \|publisher=Curran Associates \|arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv \|last1=Hanin\|first1=Boris\|last2=Sellke\|first2=Mark\|title=Approximating Continuous Functions by ReLU Nets of Minimal Width\|eprint=1710.11278\|class=stat.ML\|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference\|last1=Kidger\|first1=Patrick\|last2=Lyons\|first2=Terry\|date=July 2020\|title=Universal Approximation with Deep Narrow Networks\|arxiv=1905.08539\|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal \|last1=Kratsios \|first1=Anastasis \|last2=Papon \|first2=Léonie \|date=2022 \|title=Universal Approximation Theorems for Differentiable Geometric Deep Learning \|url=http://jmlr.org/papers/v23/21-0716.html \|journal=Journal of Machine Learning Research \|volume=23 \|issue=196 \|pages=1–73 \|arxiv=2101.05390 }}</ref> who derived explicit depth estimates depending on the regularity of the target function and of the activation function. The first examples were the ''arbitrary width'' case. [[George Cybenko]] in 1989 proved it for [[sigmoid function\|sigmoid]] activation functions.<ref name="cyb">{{cite journal \|citeseerx=10.1.1.441.7873 \|doi=10.1007/BF02551274\|title=Approximation by superpositions of a sigmoidal function\|year=1989\|last1=Cybenko\|first1=G.\|journal=Mathematics of Control, Signals, and Systems\|volume=2\|issue=4\|pages=303–314\|bibcode=1989MCSS....2..303C \|s2cid=3958369}}</ref> {{ill\|Kurt Hornik\|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA">{{cite journal \|last1=Hornik \|first1=Kurt \|last2=Stinchcombe \|first2=Maxwell \|last3=White \|first3=Halbert \|date=January 1989 \|title=Multilayer feedforward networks are universal approximators \|journal=Neural Networks \|volume=2 \|issue=5 \|pages=359–366 \|doi=10.1016/0893-6080(89)90020-8}}</ref> Hornik also showed in 1991<ref name="horn">{{Cite journal\|doi=10.1016/0893-6080(91)90009-T\|title=Approximation capabilities of multilayer feedforward networks\|year=1991\|last1=Hornik\|first1=Kurt\|journal=Neural Networks\|volume=4\|issue=2\|pages=251–257\|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name="leshno">{{Cite journal\|last1=Leshno\|first1=Moshe\|last2=Lin\|first2=Vladimir Ya.\|last3=Pinkus\|first3=Allan\|last4=Schocken\|first4=Shimon\|date=January 1993\|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function\|journal=Neural Networks\|volume=6\|issue=6\|pages=861–867\|doi=10.1016/S0893-6080(05)80131-5\|s2cid=206089312\|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name="pinkus">{{Cite journal\|last=Pinkus\|first=Allan\|date=January 1999\|title=Approximation theory of the MLP model in neural networks\|journal=Acta Numerica\|volume=8\|pages=143–195\|doi=10.1017/S0962492900002919\|bibcode=1999AcNum...8..143P\|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. === Arbitrary depth === The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space\|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)\|ReLU]] as activation functions.<ref name="park">{{Cite conference \|last1=Park \|first1=Sejun \|last2=Yun \|first2=Chulhee \|last3=Lee \|first3=Jaeho \|last4=Shin \|first4=Jinwoo \|date=2021 \|title=Minimum Width for Universal Approximation \|conference=International Conference on Learning Representations \|arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory\|control-theoretic]] arguments.<ref>{{Cite conference \|last1=Tabuada \|first1=Paulo \|last2=Gharesifard \|first2=Bahman \|date=2021 \|title=Universal approximation power of deep residual neural networks via nonlinear control theory \|conference=International Conference on Learning Representations \|arxiv=2007.06007}}</ref><ref>{{cite journal \|last1=Tabuada \|first1=Paulo \|last2=Gharesifard \|first2=Bahman \|title=Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control \|journal=IEEE Transactions on Automatic Control \|date=May 2023 \|volume=68 \|issue=5 \|pages=2715–2728 \|doi=10.1109/TAC.2022.3190051 \|s2cid=250512115 }}</ref> In 2023, Cai<ref name=":1">{{Cite journal \|last=Cai \|first=Yongqiang \|date=2023-02-01 \|title=Achieve the Minimum Width of Neural Networks for Universal Approximation \|url=https://openreview.net/forum?id=hfUJ4ShyDEU \|journal=ICLR \|arxiv=2209.11395 \|language=en}}</ref> obtained the optimal minimum width bound for the universal approximation. The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal\|last1=Gripenberg\|first1=Gustaf\|date=June 2003\|title= Approximation by neural networks with a bounded number of nodes at each level\|journal= Journal of Approximation Theory \|volume=122\|issue=2\|pages=260–266\|doi= 10.1016/S0021-9045(03)00078-9 \|doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal \|last1=Yarotsky \|first1=Dmitry \|title=Error bounds for approximations with deep ReLU networks \|journal=Neural Networks \|date=October 2017 \|volume=94 \|pages=103–114 \|doi=10.1016/j.neunet.2017.07.002 \|pmid=28756334 \|arxiv=1610.01145 \|s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal \|last1=Lu \|first1=Zhou \|last2=Pu \|first2=Hongming \|last3=Wang \|first3=Feicheng \|last4=Hu \|first4=Zhiqiang \|last5=Wang \|first5=Liwei \|title=The Expressive Power of Neural Networks: A View from the Width \|journal=Advances in Neural Information Processing Systems \|volume=30 \|year=2017 \|pages=6231–6239 \|url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width \|publisher=Curran Associates \|arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv \|last1=Hanin\|first1=Boris\|last2=Sellke\|first2=Mark\|title=Approximating Continuous Functions by ReLU Nets of Minimal Width\|eprint=1710.11278\|class=stat.ML\|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference\|last1=Kidger\|first1=Patrick\|last2=Lyons\|first2=Terry\|date=July 2020\|title=Universal Approximation with Deep Narrow Networks\|arxiv=1905.08539\|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh or GeLU. One special case of arbitrary depth is that each composition component comes from a finite set of mappings. In 2024, Cai <ref name= cai2024 >{{Cite journal\|last1=Yongqiang\|first1=Cai\|date=2024\|title= Vocabulary for Universal Approximation: A Linguistic Perspective of Mapping Compositions\|journal= ICML\|pages=5189–5208 \|arxiv=2305.12205 \|url= https://proceedings.mlr.press/v235/cai24a.html}}</ref> constructed a finite set of mappings, named a vocabulary, such that any continuous function can be approximated by compositing a sequence from the vocabulary. This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings. The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.<ref name=maiorov>{{Cite journal\|last1=Maiorov\|first1=Vitaly\|last2=Pinkus\|first2=Allan\|date=April 1999\|title=Lower bounds for approximation by MLP neural networks\|journal=Neurocomputing\|volume=25\|issue=1–3\|pages=81–91\|doi=10.1016/S0925-2312(98)00111-8}}</ref> They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. Using algorithmic and computer programming techniques, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers.<ref name=guliyev1>{{Cite journal\|last1=Guliyev\|first1=Namig\|last2=Ismailov\|first2=Vugar\|date=November 2018\|title=Approximation capability of two hidden layer feedforward neural networks with fixed weights\|journal=Neurocomputing\|volume=316\| pages=262–269\|doi=10.1016/j.neucom.2018.07.075\|arxiv=2101.09181 \|s2cid=52285996 }}</ref> It was constructively proved in 2018 paper<ref name=guliyev2>{{Cite journal\|last1=Guliyev\|first1=Namig\|last2=Ismailov\|first2=Vugar\|date=February 2018\|title=On the approximation by single hidden layer feedforward neural networks with fixed weights\|journal=Neural Networks\|volume=98\| pages=296–304\|doi=10.1016/j.neunet.2017.12.007\|pmid=29301110 \|arxiv=1708.06219 \|s2cid=4932839 }}</ref> that single hidden layer networks with bounded width are still universal approximators for univariate functions, but this property is no longer true for multivariable functions. === Bounded depth and bounded width === Several extensions of the theorem exist, such as to discontinuous activation functions,<ref name=leshno /> noncompact domains,<ref name=kidger /> certifiable networks,<ref>{{cite conference\|last1=Baader\|first1=Maximilian\|last2=Mirman\|first2=Matthew\|last3=Vechev\|first3=Martin\|date=2020\|title=Universal Approximation with Certified Networks\|url=https://openreview.net/forum?id=B1gX8kBtPr\|conference=ICLR}}</ref> The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.<ref name="maiorov">{{Cite journal\|last1=Maiorov\|first1=Vitaly\|last2=Pinkus\|first2=Allan\|date=April 1999\|title=Lower bounds for approximation by MLP neural networks\|journal=Neurocomputing\|volume=25\|issue=1–3\|pages=81–91\|doi=10.1016/S0925-2312(98)00111-8}}</ref> They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. random neural networks,<ref>{{Cite journal\|doi=10.1109/72.737488\|title=Function approximation with spiked random networks\|year=1999\|last1=Gelenbe\|first1=Erol\|last2=Mao\|first2= Zhi Hong\|last3=Li\|first3=Yan D.\|journal=IEEE Transactions on Neural Networks\|volume=10\|issue=1\|pages=3–9\|pmid=18252498 \|url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference\|last1=Lin\|first1=Hongzhou\|last2=Jegelka\|first2=Stefanie\|date=2018\|title=ResNet with one-neuron hidden layers is a Universal Approximator\|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator\|publisher=Curran Associates\|pages=6169–6178\|journal=Advances in Neural Information Processing Systems \|volume=30}}</ref> In 2018, Guliyev and Ismailov<ref name="guliyev1">{{Cite journal \|last1=Guliyev \|first1=Namig \|last2=Ismailov \|first2=Vugar \|date=November 2018 \|title=Approximation capability of two hidden layer feedforward neural networks with fixed weights \|journal=Neurocomputing \|volume=316 \|pages=262–269 \|arxiv=2101.09181 \|doi=10.1016/j.neucom.2018.07.075 \|s2cid=52285996}}</ref> constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. In 2018, they also constructed<ref name="guliyev2">{{Cite journal\|last1=Guliyev\|first1=Namig\|last2=Ismailov\|first2=Vugar\|date=February 2018\|title=On the approximation by single hidden layer feedforward neural networks with fixed weights\|journal=Neural Networks\|volume=98\| pages=296–304\|doi=10.1016/j.neunet.2017.12.007\|pmid=29301110 \|arxiv=1708.06219 \|s2cid=4932839 }}</ref> single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions. In 2023 it was published that a three-layer neural network can approximate any function (continuous and not continuous)<ref>{{cite journal \|last1=Ismailov \|first1=Vugar E. \|title=A three layer neural network can represent any multivariate function \|journal=Journal of Mathematical Analysis and Applications \|date=July 2023 \|volume=523 \|issue=1 \|pages=127096 \|doi=10.1016/j.jmaa.2023.127096 \|arxiv=2012.03016 \|s2cid=265100963 }}</ref> In 2022, Shen ''et al.''<ref name=shen22>{{cite journal \|last1=Shen \|first1=Zuowei \|last2=Yang \|first2=Haizhao \|last3=Zhang \|first3=Shijun \|date=January 2022 \|title=Optimal approximation rate of ReLU networks in terms of width and depth \|journal=Journal de Mathématiques Pures et Appliquées \|volume=157 \|pages=101–135 \|arxiv=2103.00502 \|doi=10.1016/j.matpur.2021.07.009 \|s2cid=232075797}}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks. === Quantitative bounds === The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space\|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)\|ReLU]] as activation functions.<ref name="park">{{Cite conference \|last1=Park \|first1=Sejun \|last2=Yun \|first2=Chulhee \|last3=Lee \|first3=Jaeho \|last4=Shin \|first4=Jinwoo \|date=2021 \|title=Minimum Width for Universal Approximation \|conference=International Conference on Learning Representations \|arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory\|control-theoretic]] arguments.<ref>{{Cite conference \|last1=Tabuada \|first1=Paulo \|last2=Gharesifard \|first2=Bahman \|date=2021 \|title=Universal approximation power of deep residual neural networks via nonlinear control theory \|conference=International Conference on Learning Representations \|arxiv=2007.06007}}</ref><ref>{{cite journal \|last1=Tabuada \|first1=Paulo \|last2=Gharesifard \|first2=Bahman \|date=May 2023 \|title=Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control \|journal=IEEE Transactions on Automatic Control \|volume=68 \|issue=5 \|pages=2715–2728 \|doi=10.1109/TAC.2022.3190051 \|s2cid=250512115}}{{Erratum\|doi=10.1109/TAC.2024.3390099\|checked=yes}}</ref> In 2023, Cai obtained the optimal minimum width bound for the universal approximation.<ref name=":1">{{Cite journal \|last=Cai \|first=Yongqiang \|date=2023-02-01 \|title=Achieve the Minimum Width of Neural Networks for Universal Approximation \|url=https://openreview.net/forum?id=hfUJ4ShyDEU \|journal=ICLR \|language=en \|arxiv=2209.11395}}</ref> For the arbitrary depth case, Leonie Papon and Anastasis Kratsios derived explicit depth estimates depending on the regularity of the target function and of the activation function.<ref name="jmlr.org">{{Cite journal \|last1=Kratsios \|first1=Anastasis \|last2=Papon \|first2=Léonie \|date=2022 \|title=Universal Approximation Theorems for Differentiable Geometric Deep Learning \|url=http://jmlr.org/papers/v23/21-0716.html \|journal=Journal of Machine Learning Research \|volume=23 \|issue=196 \|pages=1–73 \|arxiv=2101.05390}}</ref> === Kolmogorov network === The [[Kolmogorov–Arnold representation theorem]] is similar in spirit. Indeed, certain neural network families can directly apply the Kolmogorov–Arnold theorem to yield a universal approximation theorem. [[Robert Hecht-Nielsen]] showed that a three-layer neural network can approximate any continuous multivariate function.<ref>{{Cite journal \|last=Hecht-Nielsen \|first=Robert \|date=1987 \|title=Kolmogorov's mapping neural network existence theorem \|url=https://cir.nii.ac.jp/crid/1572543025788928512 \|journal=Proceedings of International Conference on Neural Networks, 1987 \|volume=3 \|pages=11–13}}</ref> This was extended to the discontinuous case by Vugar Ismailov.<ref>{{cite journal \|last1=Ismailov \|first1=Vugar E. \|date=July 2023 \|title=A three layer neural network can represent any multivariate function \|journal=Journal of Mathematical Analysis and Applications \|volume=523 \|issue=1 \|pages=127096 \|arxiv=2012.03016 \|doi=10.1016/j.jmaa.2023.127096 \|s2cid=265100963}}</ref> In 2024, Ziming Liu and co-authors showed a practical application.<ref>{{cite arXiv \|last1=Liu \|first1=Ziming \|title=KAN: Kolmogorov-Arnold Networks \|date=2024-05-24 \|eprint=2404.19756 \|last2=Wang \|first2=Yixuan \|last3=Vaidya \|first3=Sachin \|last4=Ruehle \|first4=Fabian \|last5=Halverson \|first5=James \|last6=Soljačić \|first6=Marin \|last7=Hou \|first7=Thomas Y. \|last8=Tegmark \|first8=Max\|class=cs.LG }}</ref> === Reservoir computing and quantum reservoir computing=== In reservoir computing a sparse recurrent neural network with fixed weights equipped of fading memory and echo state property is followed by a trainable output layer. Its universality has been demonstrated separately for what concerns networks of rate neurons <ref>{{Cite journal \|last1=Grigoryeva \|first1=L. \|last2=Ortega \|first2=J.-P. \|date=2018 \|title=Echo state networks are universal \|journal=Neural Networks \|volume=108 \|issue=1 \|pages=495–508 \|arxiv=1806.00797 \|doi=10.1016/j.neunet.2018.08.025\|pmid=30317134 }}</ref> and spiking neurons, respectively. <ref>{{Cite journal \|last1=Maass \|first1=Wolfgang \|last2=Markram \|first2=Henry \|date=2004 \|title=On the computational power of circuits of spiking neurons \|url=http://www.igi.tugraz.at/maass/psfiles/135.pdf \|journal=Journal of Computer and System Sciences \|volume=69 \|issue=4 \|pages=593–616\|doi=10.1016/j.jcss.2004.04.001 }}</ref> In 2024, the framework has been generalized and extended to quantum reservoirs where the reservoir is based on qubits defined over Hilbert spaces. <ref>{{cite arXiv \|last1=Monzani \|first1=Francesco \|title=Universality conditions of unified classical and quantum reservoir computing \|date=2024\|eprint=2401.15067 \|last2=Prati \|first2=Enrico \|class=quant-ph }}</ref> === Variants === Discontinuous activation functions,<ref name="leshno" /> noncompact domains,<ref name="kidger" /><ref>{{Cite journal \|last1=van Nuland \|first1=Teun \|year=2024 \|title=Noncompact uniform universal approximation \|url=https://doi.org/10.1016/j.neunet.2024.106181 \|journal=Neural Networks \|volume=173\|doi=10.1016/j.neunet.2024.106181 \|pmid=38412737 \|arxiv=2308.03812 }}</ref> certifiable networks,<ref>{{cite conference \|last1=Baader \|first1=Maximilian \|last2=Mirman \|first2=Matthew \|last3=Vechev \|first3=Martin \|date=2020 \|title=Universal Approximation with Certified Networks \|url=https://openreview.net/forum?id=B1gX8kBtPr \|conference=ICLR}}</ref> random neural networks,<ref>{{Cite journal \|last1=Gelenbe \|first1=Erol \|last2=Mao \|first2=Zhi Hong \|last3=Li \|first3=Yan D. \|year=1999 \|title=Function approximation with spiked random networks \|url=https://zenodo.org/record/6817275 \|journal=IEEE Transactions on Neural Networks \|volume=10 \|issue=1 \|pages=3–9 \|doi=10.1109/72.737488 \|pmid=18252498}}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference \|last1=Lin \|first1=Hongzhou \|last2=Jegelka \|first2=Stefanie\|author2-link=Stefanie Jegelka \|date=2018 \|title=ResNet with one-neuron hidden layers is a Universal Approximator \|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator \|publisher=Curran Associates \|volume=30 \|pages=6169–6178 \|journal=Advances in Neural Information Processing Systems}}</ref> The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension <math>d_x</math> and output dimension <math>d_y</math> the minimum width required for the universal approximation of the ''[[Lp space\|L<sup>p</sup>]]'' functions is exactly <math>max\{d_x + 1, d_y\}</math> (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate <math>C(K, R d_y)</math> in general while maintaining width <math>max\{d_x + 1, d_y\}</math>? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function\|threshold activation function]] are used.<ref name="park" /> Universal function approximation on graphs (or rather on [[Graph isomorphism\|graph isomorphism classes]]) by popular [[Graph neural network\|graph convolutional neural networks]] (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.<ref name="PowerGNNs">{{Cite conference \|last1=Xu \|first1=Keyulu \|last2=Hu \|first2=Weihua \|last3=Leskovec \|first3=Jure \|last4=Jegelka \|first4=Stefanie\|author4-link=Stefanie Jegelka \|date=2019 \|title=How Powerful are Graph Neural Networks? \|url=https://openreview.net/forum?id=ryGs6iA5Km \|journal=International Conference on Learning Representations}}</ref> In 2020,<ref name="UniversalGraphs">{{Cite conference \|last1=Brüel-Gabrielsson \|first1=Rickard \|date=2020 \|title=Universal Function Approximation on Graphs \|url=https://proceedings.neurips.cc//paper/2020/hash/e4acb4c86de9d2d9a41364f93951028d-Abstract.html \|publisher=Curran Associates \|volume=33 \|journal=Advances in Neural Information Processing Systems}}</ref> a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying <math>\mathcal O(\left\|V\right\| \cdot \left\|E\right\|)</math>-runtime method that performed at state of the art on a collection of benchmarks (where <math>V</math> and <math>E</math> are the sets of nodes and edges of the graph respectively). There are also a variety of results between [[non-Euclidean space]]s<ref name="NonEuclidean">{{Cite conference \|last1=Kratsios \|first1=Anastasis \|last2=Bilokopytov \|first2=Eugene \|date=2020 \|title=Non-Euclidean Universal Approximation \|url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf \|publisher=Curran Associates \|volume=33 \|journal=Advances in Neural Information Processing Systems}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal \|last1=Zhou \|first1=Ding-Xuan \|year=2020 \|title=Universality of deep convolutional neural networks \|journal=[[Applied and Computational Harmonic Analysis]] \|volume=48 \|issue=2 \|pages=787–794 \|arxiv=1805.10769 \|doi=10.1016/j.acha.2019.06.004 \|s2cid=44113176}}</ref><ref>{{Cite journal \|last1=Heinecke \|first1=Andreas \|last2=Ho \|first2=Jinn \|last3=Hwang \|first3=Wen-Liang \|year=2020 \|title=Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets \|journal=IEEE Signal Processing Letters \|volume=27 \|pages=1175–1179 \|bibcode=2020ISPL...27.1175H \|doi=10.1109/LSP.2020.3005051 \|s2cid=220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal \|last1=Park \|first1=J. \|last2=Sandberg \|first2=I. W. \|year=1991 \|title=Universal Approximation Using Radial-Basis-Function Networks \|journal=Neural Computation \|volume=3 \|issue=2 \|pages=246–257 \|doi=10.1162/neco.1991.3.2.246 \|pmid=31167308 \|s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal \|last1=Yarotsky \|first1=Dmitry \|year=2021 \|title=Universal Approximations of Invariant Maps by Neural Networks \|journal=Constructive Approximation \|volume=55 \|pages=407–474 \|arxiv=1804.10306 \|doi=10.1007/s00365-021-09546-1 \|s2cid=13745401}}</ref><ref>{{cite journal \|last1=Zakwan \|first1=Muhammad \|last2=d’Angelo \|first2=Massimiliano \|last3=Ferrari-Trecate \|first3=Giancarlo \|date=2023 \|title=Universal Approximation Property of Hamiltonian Deep Neural Networks \|journal=IEEE Control Systems Letters \|page=1 \|arxiv=2303.12147 \|doi=10.1109/LCSYS.2023.3288350 \|s2cid=257663609}}</ref> == Arbitrary-width case == A universal approximation theorem formally states that a family of neural network functions is a [[dense set]] within a larger space of functions they are intended to approximate. In more direct terms, for any function <math>f</math> from a given function space, there exists a sequence of neural networks <math>\phi_1, \phi_2, \dots</math> from the family, such that <math>\phi_n \to f</math> according to some criterion.<ref name="cyb" /><ref name="MLP-UA" /> A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill\|Kurt Hornik\|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{cite journal \|last1=Funahashi \|first1=Ken-Ichi \|title=On the approximate realization of continuous mappings by neural networks \|journal=Neural Networks \|date=January 1989 \|volume=2 \|issue=3 \|pages=183–192 \|doi=10.1016/0893-6080(89)90003-8 }}</ref><ref name=cyb /><ref name=":0">{{cite journal \|last1=Hornik \|first1=Kurt \|last2=Stinchcombe \|first2=Maxwell \|last3=White \|first3=Halbert \|title=Multilayer feedforward networks are universal approximators \|journal=Neural Networks \|date=January 1989 \|volume=2 \|issue=5 \|pages=359–366 \|doi=10.1016/0893-6080(89)90020-8 }}</ref><ref name=horn /> See<ref>Haykin, Simon (1998). ''Neural Networks: A Comprehensive Foundation'', Volume 2, Prentice Hall. {{isbn\|0-13-273350-1}}.</ref><ref>Hassoun, M. (1995) ''Fundamentals of Artificial Neural Networks'' MIT Press, p. 48</ref><ref name="pinkus" /> for reviews. The following is the most often quoted:{{math_theorem A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill\|Kurt Hornik\|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{cite journal \|last1=Funahashi \|first1=Ken-Ichi \|title=On the approximate realization of continuous mappings by neural networks \|journal=Neural Networks \|date=January 1989 \|volume=2 \|issue=3 \|pages=183–192 \|doi=10.1016/0893-6080(89)90003-8 }}</ref><ref name="MLP-UA" /><ref name="cyb" /><ref name="horn" /> See<ref>Haykin, Simon (1998). ''Neural Networks: A Comprehensive Foundation'', Volume 2, Prentice Hall. {{isbn\|0-13-273350-1}}.</ref><ref>Hassoun, M. (1995) ''Fundamentals of Artificial Neural Networks'' MIT Press, p. 48</ref><ref name="pinkus" /> for reviews. The following is the most often quoted:{{math_theorem \| name = Universal approximation theorem\|Let <math>C(X, \mathbb{R}^m)</math> denote the set of [[continuous functions]] from a subset <math>X </math> of a Euclidean <math>\mathbb{R}^n</math> space to a Euclidean space <math>\mathbb{R}^m</math>. Let <math>\sigma \in C(\mathbb{R}, \mathbb{R})</math>. Note that <math>(\sigma \circ x)_i = \sigma(x_i)</math>, so <math>\sigma \circ x</math> denotes <math>\sigma</math> applied to each component of <math>x</math>. Line 43 ⟶ 67: Also, certain non-continuous activation functions can be used to approximate a sigmoid function, which then allows the above theorem to apply to those functions. For example, the [[step function]] works. In particular, this shows that a [[perceptron]] network with a single infinitely wide hidden layer can approximate arbitrary functions. Such an <math>f</math> can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers. {{Math proof ~~{{Math proof\|title=Proof sketch\|proof=It suffices to prove the case where <math>m = 1</math>, since uniform convergence in <math>\R^m</math> is just uniform convergence in each coordinate.~~ \| title = Proof sketch \| proof = It suffices to prove the case where <math>m = 1</math>, since uniform convergence in <math>\R^m</math> is just uniform convergence in each coordinate. Let <math>F_\sigma</math> be the set of all one-hidden-layer neural networks constructed with <math>\sigma</math>. Let <math>C_0(\R^d, \R)</math> be the set of all <math>C(\R^d, \R)</math> with compact support. Line 52 ⟶ 78: Otherwise, we show that <math>F_\sigma</math>'s closure is all of <math>C_0(\R^d, \R)</math>. Suppose we can construct arbitrarily good approximations of the ramp function <math display="block">r(x) = \begin{cases} -1 & \text{if } x < -1 \\ \phantom{+}x & \text{if } \|x\|\leq 1 \\ Line 65 ⟶ 91: The case where <math>\sigma</math> is a generic non-polynomial function is harder, and the reader is directed to.<ref name="pinkus" />}} The above proof has not specified how one might use a ramp function to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>. A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the [[Dirac delta function]], then use those to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>.<ref>{{Cite ~~journal~~book \|last=Nielsen \|first=Michael A. \|date=2015 \|title=Neural Networks and Deep Learning \|url=http://neuralnetworksanddeeplearning.com/ \|language=en}}</ref> The original proofs, such as the one by Cybenko, use methods from functional analysis, including the [[Hahn–Banach theorem\|Hahn-Banach]] and [[~~Riesz~~Riesz–Markov–Kakutani representation theorem\|~~Riesz~~Riesz–Markov–Kakutani representation]] theorems. Cybenko first published the theorem in a technical report in 1988,<ref>G. Cybenko, "Continuous Valued Neural Networks with Two Hidden Layers are Sufficient", Technical Report, Department of Computer Science, Tufts University, 1988.</ref> then as a paper in 1989.<ref name="cyb" /> Notice also that the neural network is only required to approximate within a compact set <math>K</math>. The proof does not describe how the function would be extrapolated outside of the region. ~~The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization:<ref name=":0" />~~ The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization:<ref name="MLP-UA" /> {{math_theorem \| name = Universal approximation theorem for pi-sigma networks\|With any nonconstant activation function, a one-hidden-layer pi-sigma network is a universal approximator. Line 73 ⟶ 101: == Arbitrary-depth case == The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017.<ref name=ZhouLu /> They showed that networks of width ''n'' + 4 with [[ReLU]] activation functions can approximate any [[Lebesgue integration\|Lebesgue-integrable function]] on ''n''-dimensional input space with respect to [[L1 distance\|<math>L^1</math> distance]] if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper<ref name=ZhouLu /> it was shown that [[ReLU]] networks with width ''n'' + 1 were sufficient to approximate any [[continuous function\|continuous]] function of ''n''-dimensional input variables.<ref~~>Hanin,~~ ~~B. (2018). [[arxiv:1710.11278\|Approximating Continuous Functions by ReLU Nets of Minimal Width]]. arXiv preprint arXiv:1710.11278.<~~name=hanin/~~ref~~> The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.<ref>{{Cite journal \|last=Park, Yun, Lee, Shin \|first=Sejun, Chulhee, Jaeho, Jinwoo \|date=2020-09-28 \|title=Minimum Width for Universal Approximation \|url=https://openreview.net/forum?id=O-XJwyoIF-k \|journal=ICLR \|arxiv=2006.08859 \|language=en}}</ref> {{math theorem ~~<blockquote>~~ ~~'''~~\| name = Universal approximation theorem~~'''~~ ''(L1 distance, ReLU activation, arbitrary depth, minimal width).'' \| math_statement = For any [[Bochner integral\|Bochner–Lebesgue p-integrable]] function <math>f : \mathbb R^n \to \mathbb R^m</math> and any <math>\varepsilon > 0</math>, there exists a [[fully connected network\|fully connected]] [[ReLU]] network <math>F</math> of width exactly <math>d_m = \max\{n + 1, m\}</math>, satisfying : <math display="block">\int_{\mathbb R^n} \\|f(x) - F(x)\\|^p \, \mathrm{d}x < \varepsilon.</math> Moreover, there exists a function <math>f \in L^p(\mathbb{R}^n, \mathbb{R}^m)</math> and some <math>\varepsilon > 0</math>, for which there is no [[fully connected network\|fully connected]] [[ReLU]] network of width less than <math>d_m = \max\{n + 1 ,m\}</math> satisfying the above approximation bound. Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact ___domain, then the exact minimum width is<ref name=":1" /> <math>d_m = \max\{n, m, 2\}</math>. ''Quantitative refinement:'' In the case where~~, when~~ <math>~~\mathcal{X} =~~ f:[0, 1]^dn \rightarrow \mathbb{R} </math>, ~~and~~(i.e. <math>D m = 1 </math>) and ~~where~~ <math>\sigma</math> is the [[Rectifier (neural networks)\|ReLU activation function]], the exact depth and width for a ReLU network to achieve <math>\varepsilon</math> error is also known.<ref~~>{{cite~~ ~~journal \|last1~~name=Shen \|first1=Zuowei \|last2=Yang \|first2=Haizhao \|last3=Zhang \|first3=Shijun \|title=Optimal approximation rate of ReLU networks in terms of width and depth \|journal=Journal de Mathématiques Pures et Appliquées \|date=January 2022 \|volume=157 \|pages=101–135 \|doi=10.1016shen22/~~j.matpur.2021.07.009 \|arxiv=2103.00502 \|s2cid=232075797 }}</ref~~> If, moreover, the target function <math>f</math> is smooth, then the required number of layer and their width can be exponentially smaller.<ref>{{cite journal \|last1=Lu \|first1=Jianfeng \|last2=Shen \|first2=Zuowei \|last3=Yang \|first3=Haizhao \|last4=Zhang \|first4=Shijun \|title=Deep Network Approximation for Smooth Functions \|journal = SIAM Journal on Mathematical Analysis \|date=January 2021 \|volume=53 \|issue=5 \|pages=5465–5506 \|doi=10.1137/20M134695X \|arxiv=2001.03040 \|s2cid=210116459 }}</ref> Even if <math>f</math> is not smooth, the curse of dimensionality can be broken if <math>f</math> admits additional "compositional structure".<ref>{{Cite journal \|last1=Juditsky \|first1=Anatoli B. \|last2=Lepski \|first2=Oleg V. \|last3=Tsybakov \|first3=Alexandre B. \|date=2009-06-01 \|title=Nonparametric estimation of composite functions \|journal=The Annals of Statistics \|volume=37 \|issue=3 \|doi=10.1214/08-aos611 \|s2cid=2471890 \|issn=0090-5364\|doi-access=free \|arxiv=0906.0865 }}</ref><ref>{{Cite journal \|last1=Poggio \|first1=Tomaso \|last2=Mhaskar \|first2=Hrushikesh \|last3=Rosasco \|first3=Lorenzo \|last4=Miranda \|first4=Brando \|last5=Liao \|first5=Qianli \|date=2017-03-14 \|title=Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review \|journal=International Journal of Automation and Computing \|volume=14 \|issue=5 \|pages=503–519 \|doi=10.1007/s11633-017-1054-2 \|s2cid=15562587 \|issn=1476-8186\|doi-access=free \|arxiv=1611.00740 }}</ref> }} ~~</blockquote>~~ Together, the central result of<ref name=kidger /> yields the following universal approximation theorem for networks with bounded width (see also<ref name=gripenberg /> for the first result of this kind). {{math theorem ~~<blockquote>~~ ~~'''~~\| name = Universal approximation theorem~~'''~~ (Uniform non-[[Affine transformation\|affine]] activation, arbitrary [[Deep learning\|depth]], constrained width). \| math_statement = Let <math>\mathcal{X}</math> be a [[Compact set\|compact subset]] of <math>\mathbb{R}^d</math>. Let <math>\sigma:\mathbb{R} \to \mathbb{R}</math> be any non-[[Affine transformation\|affine]] [[Continuous function\|continuous]] function which is [[Differentiable function#Differentiability classes\|continuously differentiable]] at at least one point, with nonzero [[derivative]] at that point. Let <math>\mathcal{N}_{d,D:d+D+2}^\sigma</math> denote the space of feed-forward neural networks with <math>d</math> input neurons, <math>D</math> output neurons, and an arbitrary number of hidden layers each with <math>d + D + 2</math> neurons, such that every hidden neuron has activation function <math>\sigma</math> and every output neuron has the [[identity function\|identity]] as its activation function, with input layer <math>\phi</math> and output layer <math>\rho</math>. Then given any <math>\varepsilon > 0</math> and any <math>f \in C(\mathcal{X}, \mathbb{R}^D)</math>, there exists <math>\hat{f} \in \mathcal{N}_{d,D:d+D+2}^\sigma</math> such that <math display="block"> ~~: <math>~~ \sup_{x \in \mathcal{X}} \left\\|\hat{f}(x) - f(x)\right\\| < \varepsilon. </math> Line 97 ⟶ 126: ''Quantitative refinement:'' The number of layers and the width of each layer required to approximate <math>f</math> to <math>\varepsilon</math> precision known;<ref name="jmlr.org"/> moreover, the result hold true when <math>\mathcal{X}</math> and <math>\mathbb{R}^D</math> are replaced with any non-positively curved [[Riemannian manifold]]. }} ~~</blockquote>~~ Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.<ref name="ZhouLu" /><ref name=hanin /><ref name=johnson>{{cite conference \|last=Johnson \|first=Jesse \|conference=International Conference on Learning Representations \|date=2019 \|url=https://openreview.net/forum?id=ryGgSsAcFQ \|title=Deep, Skinny Neural Networks are not Universal Approximators}}</ref> Line 104 ⟶ 133: The first result on approximation capabilities of neural networks with bounded number of layers, each containing a limited number of artificial neurons was obtained by Maiorov and Pinkus.<ref name=maiorov /> Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough. {{math theorem ~~<blockquote>~~ ~~'''~~\| name = Universal approximation theorem:<ref name=maiorov />~~''' There exists an activation function <math>\sigma</math> which is analytic, strictly increasing and~~ \| math_statement= There exists an activation function <math>\sigma</math> which is analytic, strictly increasing and sigmoidal and has the following property: For any <math> f\in C[0,1]^{d}</math> and <math> \varepsilon >0</math> there exist constants <math>d_{i}, c_{ij}, \theta _{ij}, \gamma _{i}</math>, and vectors <math> \mathbf{w}^{ij}\in \mathbb{R}^{d}</math> for which <math display='block'> \left\vert f(\mathbf{x})-\sum_{i=1}^{6d+3} d_{i}\sigma \left( \sum_{j=1}^{3d} c_{ij} \sigma (\mathbf{w}^{ij}\cdot \mathbf{x-}\theta_{ij}) - \gamma_{i}\right) \right\vert <\varepsilon </math> ~~<math display='block'> \left\vert f(\mathbf{x})-\sum_{i=1}^{6d+3}d_{i}\sigma \left(~~ ~~\sum_{j=1}^{3d}c_{ij}\sigma (\mathbf{w}^{ij}\cdot \mathbf{x-}\theta~~ ~~_{ij})-\gamma _{i}\right) \right\vert <\varepsilon </math>~~ for all <math> \mathbf{x}=(x_{1},...,x_{d})\in [0,1]^{d}</math>. }} ~~</blockquote>~~ This is an existence result. It says that activation functions providing universal approximation property for bounded depth bounded width networks exist. Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter. The developed algorithm allows one to compute the activation functions at any point of the real axis instantly. For the algorithm and the corresponding computer code see.<ref name=guliyev1 /> The theoretical result can be formulated as follows. {{math theorem ~~<blockquote>~~ ~~'''~~\| name = Universal approximation theorem:<ref name=guliyev1 /><ref name=guliyev2 />~~'''~~ \| math_statement = Let <math> [a,b]</math> be a finite segment of the real line, <math> s=b-a</math> and <math> \lambda</math> be any positive number. Then one can algorithmically construct a computable sigmoidal activation function <math> \sigma \colon \mathbb{R} \to \mathbb{R}</math>, which is infinitely differentiable, strictly increasing on <math> (-\infty, s) </math>, <math> \lambda</math> -strictly increasing on <math> [s,+\infty) </math>, and satisfies the following properties: 1)# For any <math> f \in C[a,b] </math> and <math> \varepsilon > 0</math> there exist numbers <math> c_1,c_2,\theta_1</math> and <math> \theta_2</math> such that for all <math>x \in [a,b] </math> <math display='block'> \|f(x) - c_1 \sigma(x - \theta_1) - c_2 \sigma(x - \theta_2)\| < \varepsilon</math> # For any continuous function <math>F</math> on the <math>d</math>-dimensional box <math>[a,b]^{d}</math> and <math>\varepsilon > 0</math>, there exist constants <math>e_p</math>, <math>c_{pq}</math>, <math>\theta_{pq}</math> and <math>\zeta_p</math> such that the inequality <math display='block'> \left\| F(\mathbf{x}) - \sum_{p=1}^{2d+2} e_p \sigma \left( \sum_{q=1}^{d} c_{pq} \sigma(\mathbf{w}^{q} \cdot \mathbf{x} - \theta_{pq}) - \zeta_p \right) \right\| < \varepsilon</math> holds for all <math>\mathbf{x} = (x_1, \ldots, x_d) \in [a, b]^{d}</math>. Here the weights <math>\mathbf{w}^{q}</math>, <math>q = 1, \ldots, d</math>, are fixed as follows: <math display='block'> \mathbf{w}^{1} = (1, 0, \ldots, 0), \quad \mathbf{w}^{2} = (0, 1, \ldots, 0), \quad \ldots, \quad \mathbf{w}^{d} = (0, 0, \ldots, 1). </math> In addition, all the coefficients <math>e_p</math>, except one, are equal. ~~<math display='block'> \|f(x) - c_1 \sigma(x - \theta_1) - c_2 \sigma(x - \theta_2)\| < \varepsilon</math>~~ }} 2) For any continuous function <math>F</math> on the <math>d</math>-dimensional box <math>[a,b]^{d}</math> and <math>\varepsilon >0</math>, there exist constants <math>e_p</math>, <math>c_{pq}</math>, <math>\theta_{pq}</math> and <math>\zeta_p</math> such that the inequality <math display='block'> \left\| F(\mathbf{x}) - \sum_{p=1}^{2d+2} e_p \sigma \left( \sum_{q=1}^{d} c_{pq} \sigma(\mathbf{w}^{q} \cdot \mathbf{x} - \theta_{pq}) - \zeta_p \right) \right\| < \varepsilon</math> ~~holds for all <math>\mathbf{x} = (x_1, \ldots, x_d) \in [a, b]^{d}</math>. Here the weights <math>\mathbf{w}^{q}</math>, <math>q = 1, \ldots, d</math>, are fixed as follows:~~ ~~<math display='block'> \mathbf{w}^{1} = (1, 0, \ldots, 0), \quad \mathbf{w}^{2} = (0, 1, \ldots, 0), \quad \ldots, \quad \mathbf{w}^{d} = (0, 0, \ldots, 1). </math>~~ ~~In addition, all the coefficients <math>e_p</math>, except one, are equal.~~ ~~</blockquote>~~ Here “<math> \sigma \colon \mathbb{R} \to \mathbb{R}</math> is <math>\lambda</math>-strictly increasing on some set <math>X</math>” means that there exists a strictly increasing function <math>u \colon X \to \mathbb{R}</math> such that <math>\|\sigma(x) - u(x)\| \le \lambda</math> for all <math>x \in X</math>. Clearly, a <math>\lambda</math>-increasing function behaves like a usual increasing function as <math>\lambda</math> gets small. In the "''depth-width''" terminology, the above theorem says that for certain activation functions depth-<math>2</math> width-<math>2</math> networks are universal approximators for univariate functions and depth-<math>3</math> width-<math> (2d+2) </math> networks are universal approximators for <math>d</math>-variable functions (<math>d>1</math>). ~~== Graph input ==~~ Achieving useful universal function approximation on graphs (or rather on [[Graph isomorphism\|graph isomorphism classes]]) has been a longstanding problem. The popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.<ref name=PowerGNNs>{{Cite conference\|last1=Xu\|first1=Keyulu\|last2=Hu\|first2=Weihua\|last3=Leskovec\|first3=Jure\|last4=Jegelka\|first4=Stefanie\|date=2019\|title=How Powerful are Graph Neural Networks?\|url=https://openreview.net/forum?id=ryGs6iA5Km\|journal=International Conference on Learning Representations}}</ref> In 2020,<ref name=UniversalGraphs>{{Cite conference\|last1=Brüel-Gabrielsson\|first1=Rickard\|date=2020\|title=Universal Function Approximation on Graphs\|url=https://proceedings.neurips.cc//paper/2020/hash/e4acb4c86de9d2d9a41364f93951028d-Abstract.html\|publisher=Curran Associates\|journal=Advances in Neural Information Processing Systems \|volume=33}}</ref> a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying <math>O(</math>#edges<math>\times</math>#nodes<math>)</math>-runtime method that performed at state of the art on a collection of benchmarks. == See also == Line 149 ⟶ 166: {{Differentiable computing}} [[Category:Theorems in mathematical analysis]] [[Category:Artificial neural networks]] [[Category:Network architecture]] [[Category:Networks]] [[Category:Approximation theory]]