Universal approximation theorem: Difference between revisions

Content deleted Content added
m link pyramidal neurons
cleaning up lead section
Line 2:
{{Technical|date=July 2023}}
 
In the [[mathematics|mathematical]] theory of [[artificial neural networks]], '''universal approximation theorems''' are theorems<ref name="MLP-UA">{{cite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |title=Multilayer feedforward networks are universal approximators |journal=Neural Networks |date=January 1989 |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8 }}</ref><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> of the following form: Given a family of neural networks, for each function <math>f</math> from a certain [[function space]], there exists a sequence of neural networks <math>\phi_1, \phi_2, \dots</math> from the family, such that <math>\phi_n \to f</math> according to some criteria. That is, the family of neural networks is [[Dense set|dense]] in the function space.
[[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)|vectors]] to real-valued [[vector (mathematics and physics)|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.
 
The most popular version states that [[feedforward neural network|feedforward networks]] with non-polynomial activation functions are dense in the space of continuous functions between two [[Euclidean space]]s, with respect to the [[compact convergence]] topology.
In the [[mathematics|mathematical]] theory of [[artificial neural networks]], '''universal approximation theorems''' are results<ref name=MLP-UA>{{cite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |title=Multilayer feedforward networks are universal approximators |journal=Neural Networks |date=January 1989 |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8 }}</ref><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> that put limits on what neural networks can theoretically learn. Specifically, given an algorithm that generates the networks within a class of functions, the theorems establish the [[dense set|density]] of the generated functions within a given function space of interest. Typically, these results concern the approximation capabilities of the [[feedforward neural network|feedforward architecture]] on the space of continuous functions between two [[Euclidean space]]s, and the approximation is with respect to the [[compact convergence]] topology. What must be stressed, is that while some functions can be arbitrarily well approximated in a region, the proofs do not apply outside of the region, i.e. the approximated functions do not [[extrapolate]] outside of the region. That applies for all non-periodic [[activation function]]s, i.e. what's in practice used and most proofs assume. In recent years neocortical [[pyramidal neurons]] with oscillating activation function that can individually learn the XOR function have been discovered in the human brain and oscillating activation functions have been explored and shown to outperform popular activation functions on a variety of benchmarks.<ref>{{cite journal |last1=Gidon |first1=Albert |last2=Zolnik |first2=Timothy Adam |last3=Fidzinski |first3=Pawel |last4=Bolduan |first4=Felix |last5=Papoutsi |first5=Athanasia |last6=Poirazi |first6=Panayiota |last7=Holtkamp |first7=Martin |last8=Vida |first8=Imre |last9=Larkum |first9=Matthew Evan |title=Dendritic action potentials and computation in human layer 2/3 cortical neurons |journal=Science |date=3 January 2020 |volume=367 |issue=6473 |pages=83–87 |doi=10.1126/science.aax6239 |pmid=31896716 |bibcode=2020Sci...367...83G }}</ref>
 
Universal approximation theorems are existence theorems: They simply state that there ''exists'' such a sequence <math>\phi_1, \phi_2, \dots \to f</math>, and do not provide any way to actually find such a sequence. They also do not guarantee any method, such as [[backpropagation]], might actually find such a sequence. Any method for searching the space of neural networks, including backpropagation, might find a converging sequence, or not (i.e. the backpropagation might get stuck in a local optimum).
However, there are also a variety of results between [[non-Euclidean space]]s<ref name=NonEuclidean>{{Cite conference|last1=Kratsios|first1=Anastasis|last2=Bilokopytov|first2=Eugene|date=2020|title=Non-Euclidean Universal Approximation|url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf|publisher=Curran Associates|journal=Advances in Neural Information Processing Systems |volume=33}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |doi=10.1016/j.acha.2019.06.004 |arxiv=1805.10769|title=Universality of deep convolutional neural networks|year=2020|last1=Zhou|first1=Ding-Xuan|journal=[[Applied and Computational Harmonic Analysis]]|volume=48|issue=2|pages=787–794|s2cid=44113176}}</ref><ref>{{Cite journal|doi = 10.1109/LSP.2020.3005051|title = Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets|year = 2020|last1 = Heinecke|first1 = Andreas|last2 = Ho|first2 = Jinn|last3 = Hwang|first3 = Wen-Liang|journal = IEEE Signal Processing Letters|volume = 27|pages = 1175–1179|bibcode = 2020ISPL...27.1175H|s2cid = 220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal|doi=10.1162/neco.1991.3.2.246|title=Universal Approximation Using Radial-Basis-Function Networks|year=1991|last1=Park|first1=J.|last2=Sandberg|first2=I. W.|journal=Neural Computation|volume=3|issue=2|pages=246–257|pmid=31167308|s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |doi=10.1007/s00365-021-09546-1|arxiv=1804.10306|title=Universal Approximations of Invariant Maps by Neural Networks|year=2021|last1=Yarotsky|first1=Dmitry|journal=Constructive Approximation|volume=55 |pages=407–474 |s2cid=13745401}}</ref><ref>{{cite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |journal=IEEE Control Systems Letters |date=2023 |page=1 |doi=10.1109/LCSYS.2023.3288350 |arxiv=2303.12147 |s2cid=257663609 }}</ref> Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
 
Universal approximation theorems are limit theorems: They simply that that for any <math>f</math> and a criteria of closeness <math>\epsilon > 0</math>, if there are ''enough'' neurons in a neural network, then there exists a neural network with that many neurons that does approximate <math>f</math> to within <math>\epsilon</math>. There is no guarantee that any finite size, say, 10000 neurons, is enough.
Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal).
 
== Setup ==
Neural networks with an unbounded (non-polynomial) activation function have the universal approximation property.<ref>{{Cite journal |last1=Sonoda |first1=Sho |last2=Murata |first2=Noboru |date=September 2017 |title=Neural Network with Unbounded Activation Functions is Universal Approximator |journal=Applied and Computational Harmonic Analysis |volume=43 |issue=2 |pages=233–268 |doi=10.1016/j.acha.2015.12.005|arxiv=1505.03654 |s2cid=12149203 }}</ref>
[[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)|vectors]] to real-valued [[vector (mathematics and physics)|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.
 
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy <!-- In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The --> the minimum width required for the universal approximation of the ''[[Lp space|L<sup>p</sup>]]'' functions is exactly max{dx + 1, dy} (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate C(K, R dy) in general while maintaining width max{dx + 1, dy}? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function|threshold activation function]] are used.<ref name="park" /> <!-- Ours (Theorem 4) L p (K, R dy ) conti. nonpoly‡ wmin ≤ max{dx + 2, dy + 1} -->
 
== History ==
 
One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function|sigmoid]] activation functions.<ref name=cyb>{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name=horn>{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name=leshno>{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name=pinkus>{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |date=January 2022 |volume=157 |pages=101–135 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797 |arxiv=2103.00502 }}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
=== Classical results ===
One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function|sigmoid]] activation functions.<ref name="cyb">{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name="horn">{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name="leshno">{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name="pinkus">{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |date=January 2022 |volume=157 |pages=101–135 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797 |arxiv=2103.00502 }}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
 
Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal).
 
=== Modern results ===
Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
 
However, thereThere are also a variety of results between [[non-Euclidean space]]s<ref name="NonEuclidean">{{Cite conference |last1=Kratsios |first1=Anastasis |last2=Bilokopytov |first2=Eugene |date=2020 |title=Non-Euclidean Universal Approximation |url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems |volume=33}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |doilast1=10.1016/j.acha.2019.06.004Zhou |arxivfirst1=1805.10769Ding-Xuan |year=2020 |title=Universality of deep convolutional neural networks|year=2020|last1=Zhou|first1=Ding-Xuan |journal=[[Applied and Computational Harmonic Analysis]] |volume=48 |issue=2 |pages=787–794 |arxiv=1805.10769 |doi=10.1016/j.acha.2019.06.004 |s2cid=44113176}}</ref><ref>{{Cite journal |doilast1=Heinecke |first1=Andreas 10.1109/LSP.|last2=Ho |first2=Jinn |last3=Hwang |first3=Wen-Liang |year=2020.3005051 |title = Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets|year = 2020|last1 = Heinecke|first1 = Andreas|last2 = Ho|first2 = Jinn|last3 = Hwang|first3 = Wen-Liang|journal = IEEE Signal Processing Letters |volume =27 27|pages =1175–1179 1175–1179|bibcode = 2020ISPL...27.1175H |s2ciddoi=10.1109/LSP.2020.3005051 |s2cid= 220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal|doi=10.1162/neco.1991.3.2.246|title=Universal Approximation Using Radial-Basis-Function Networks|year=1991|last1=Park |first1=J. |last2=Sandberg |first2=I. W. |year=1991 |title=Universal Approximation Using Radial-Basis-Function Networks |journal=Neural Computation |volume=3 |issue=2 |pages=246–257 |doi=10.1162/neco.1991.3.2.246 |pmid=31167308 |s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |doilast1=10.1007/s00365-021-09546-1Yarotsky |arxivfirst1=1804.10306Dmitry |year=2021 |title=Universal Approximations of Invariant Maps by Neural Networks|year=2021|last1=Yarotsky|first1=Dmitry |journal=Constructive Approximation |volume=55 |pages=407–474 |arxiv=1804.10306 |doi=10.1007/s00365-021-09546-1 |s2cid=13745401}}</ref><ref>{{cite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |date=2023 |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |journal=IEEE Control Systems Letters |datepage=20231 |pagearxiv=12303.12147 |doi=10.1109/LCSYS.2023.3288350 |arxiv=2303.12147 |s2cid=257663609 }}</ref> Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
 
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal|last1=Gripenberg|first1=Gustaf|date=June 2003|title= Approximation by neural networks with a bounded number of nodes at each level|journal= Journal of Approximation Theory |volume=122|issue=2|pages=260–266|doi= 10.1016/S0021-9045(03)00078-9 |doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |title=Error bounds for approximations with deep ReLU networks |journal=Neural Networks |date=October 2017 |volume=94 |pages=103–114 |doi=10.1016/j.neunet.2017.07.002 |pmid=28756334 |arxiv=1610.01145 |s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal |last1=Lu |first1=Zhou |last2=Pu |first2=Hongming |last3=Wang |first3=Feicheng |last4=Hu |first4=Zhiqiang |last5=Wang |first5=Liwei |title=The Expressive Power of Neural Networks: A View from the Width |journal=Advances in Neural Information Processing Systems |volume=30 |year=2017 |pages=6231–6239 |url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width |publisher=Curran Associates |arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv |last1=Hanin|first1=Boris|last2=Sellke|first2=Mark|title=Approximating Continuous Functions by ReLU Nets of Minimal Width|eprint=1710.11278|class=stat.ML|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference|last1=Kidger|first1=Patrick|last2=Lyons|first2=Terry|date=July 2020|title=Universal Approximation with Deep Narrow Networks|arxiv=1905.08539|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal |last1=Kratsios |first1=Anastasis |last2=Papon |first2=Léonie |date=2022 |title=Universal Approximation Theorems for Differentiable Geometric Deep Learning |url=http://jmlr.org/papers/v23/21-0716.html |journal=Journal of Machine Learning Research |volume=23 |issue=196 |pages=1–73 |arxiv=2101.05390 }}</ref> who derived explicit depth estimates depending on the regularity of the target function and of the activation function.
Line 27 ⟶ 35:
random neural networks,<ref>{{Cite journal|doi=10.1109/72.737488|title=Function approximation with spiked random networks|year=1999|last1=Gelenbe|first1=Erol|last2=Mao|first2= Zhi Hong|last3=Li|first3=Yan D.|journal=IEEE Transactions on Neural Networks|volume=10|issue=1|pages=3–9|pmid=18252498 |url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference|last1=Lin|first1=Hongzhou|last2=Jegelka|first2=Stefanie|date=2018|title=ResNet with one-neuron hidden layers is a Universal Approximator|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator|publisher=Curran Associates|pages=6169–6178|journal=Advances in Neural Information Processing Systems |volume=30}}</ref>
 
In 2023 it was published that aA three-layer neural network can approximate any function (''continuous'' and ''discontinuous''),.<ref>{{cite journal |last1=Ismailov |first1=Vugar E. |title=A three layer neural network can represent any multivariate function |journal=Journal of Mathematical Analysis and Applications |date=July 2023 |volume=523 |issue=1 |pages=127096 |doi=10.1016/j.jmaa.2023.127096 |arxiv=2012.03016 |s2cid=265100963 }}</ref> however, the publication came without efficient learning algorithms for approximating discontinuous functions.
 
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy <!-- In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The --> the minimum width required for the universal approximation of the ''[[Lp space|L<sup>p</sup>]]'' functions is exactly max{dx + 1, dy} (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate C(K, R dy) in general while maintaining width max{dx + 1, dy}? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function|threshold activation function]] are used.<ref name="park" /> <!-- Ours (Theorem 4) L p (K, R dy ) conti. nonpoly‡ wmin ≤ max{dx + 2, dy + 1} -->
 
== Arbitrary-width case ==
Line 68 ⟶ 78:
 
The above proof has not specified how one might use a ramp function to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>. A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the [[Dirac delta function]], then use those to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>.<ref>{{Cite journal |last=Nielsen |first=Michael A. |date=2015 |title=Neural Networks and Deep Learning |url=http://neuralnetworksanddeeplearning.com/ |language=en}}</ref> The original proofs, such as the one by Cybenko, use methods from functional analysis, including the [[Hahn–Banach theorem|Hahn-Banach]] and [[Riesz representation theorem|Riesz representation]] theorems.
 
Notice also that the neural network is only required to approximate within a compact set <math>K</math>. The proof does not describe how the function would be extrapolated outside of the region.
 
The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization:<ref name=":0" />