Universal approximation theorem: Difference between revisions

Content deleted Content added
m category
revise for more general audience
Tags: possible prose issues Visual edit
Line 2:
{{Technical|date=July 2023}}
 
In the [[mathematics|mathematical]] theoryfield of [[artificialmachine neural networkslearning]], the '''universal approximation theorems''' arestate that [[Artificial neural network|neural networks]] with a certain structure can, in principle, approximate any [[continuous function]] to any desired degree of accuracy. These theorems provide a mathematical justification for using neural networks, assuring researchers that a sufficiently large or deep network can model the complex, non-linear relationships often found in real-world data.<ref name="MLP-UAUA2">{{cite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |date=January 1989 |title=Multilayer feedforward networks are universal approximators |journal=Neural Networks |date=January 1989 |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8 }}</ref><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> of the following form: Given a family of neural networks, for each function <math>f</math> from a certain [[function space]], there exists a sequence of neural networks <math>\phi_1, \phi_2, \dots</math> from the family, such that <math>\phi_n \to f</math> according to some criterion. That is, the family of neural networks is [[Dense set|dense]] in the function space.
 
The most well-known version of the theorem applies to [[Feedforward neural network|feedforward networks]] with a single hidden layer. It states that if the layer's [[activation function]] is non-[[polynomial]] (which is true for common choices like the [[sigmoid function]] or [[Rectifier (neural networks)|ReLU]]), then the network can act as a "universal approximator." Universality is achieved by increasing the number of neurons in the hidden layer, making the network "wider." Other versions of the theorem show that universality can also be achieved by keeping the network's width fixed but increasing its number of layers, making it "deeper."
The most popular version states that [[feedforward neural network|feedforward networks]] with non-[[polynomial]] [[activation function]]s are dense in the space of [[continuous function]]s between two [[Euclidean space]]s, with respect to the [[compact convergence]] [[topology]].
 
It is important to note that these are [[Existence theorem|existence theorems]]. They guarantee that a network with the right structure ''exists'', but they do not provide a method for finding the network's parameters ([[Mathematical optimization|training]] it), nor do they specify exactly how large the network must be for a given function. Finding a suitable network remains a practical challenge that is typically addressed with optimization algorithms like [[backpropagation]].
Universal approximation theorems are existence theorems: They simply state that there ''exists'' such a sequence <math>\phi_1, \phi_2, \dots \to f</math>, and do not provide any way to actually find such a sequence. They also do not guarantee any method, such as [[backpropagation]], might actually find such a sequence. Any method for searching the space of neural networks, including backpropagation, might find a converging sequence, or not (i.e. the backpropagation might get stuck in a local optimum).
 
Universal approximation theorems are limit theorems: They simply state that for any <math>f</math> and a criterion of closeness <math>\epsilon > 0</math>, if there are ''enough'' neurons in a neural network, then there exists a neural network with that many neurons that does approximate <math>f</math> to within <math>\epsilon</math>. There is no guarantee that any finite size, say, 10000 neurons, is enough.
 
== Setup ==
Line 18 ⟶ 16:
 
=== Arbitrary width ===
The first examples were the ''arbitrary width'' case. [[George Cybenko]] in 1989 proved it for [[sigmoid function|sigmoid]] activation functions.<ref name="cyb">{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|bibcode=1989MCSS....2..303C |s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA">{{cite journal |last1=Hornik |first1=Kurt |last2=Stinchcombe |first2=Maxwell |last3=White |first3=Halbert |date=January 1989 |title=Multilayer feedforward networks are universal approximators |journal=Neural Networks |volume=2 |issue=5 |pages=359–366 |doi=10.1016/0893-6080(89)90020-8}}</ref> Hornik also showed in 1991<ref name="horn">{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name="leshno">{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name="pinkus">{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function.
 
=== Arbitrary depth ===
Line 54 ⟶ 52:
 
== Arbitrary-width case ==
A universal approximation theorem formally states that a family of neural network functions is a [[dense set]] within a larger space of functions they are intended to approximate. In more direct terms, for any function <math>f</math> from a given function space, there exists a sequence of neural networks <math>\phi_1, \phi_2, \dots</math> from the family, such that <math>\phi_n \to f</math> according to some criterion.<ref name="cyb2" /><ref name="MLP-UA3" />

A spate of papers in the 1980s—1990s, from [[George Cybenko]] and {{ill|Kurt Hornik|de}} etc, established several universal approximation theorems for arbitrary width and bounded depth.<ref>{{cite journal |last1=Funahashi |first1=Ken-Ichi |title=On the approximate realization of continuous mappings by neural networks |journal=Neural Networks |date=January 1989 |volume=2 |issue=3 |pages=183–192 |doi=10.1016/0893-6080(89)90003-8 }}</ref><ref name="MLP-UA" /><ref name="cyb" /><ref name="horn" /> See<ref>Haykin, Simon (1998). ''Neural Networks: A Comprehensive Foundation'', Volume 2, Prentice Hall. {{isbn|0-13-273350-1}}.</ref><ref>Hassoun, M. (1995) ''Fundamentals of Artificial Neural Networks'' MIT Press, p.&nbsp;48</ref><ref name="pinkus" /> for reviews. The following is the most often quoted:{{math_theorem
| name = Universal approximation theorem|Let <math>C(X, \mathbb{R}^m)</math> denote the set of [[continuous functions]] from a subset <math>X </math> of a Euclidean <math>\mathbb{R}^n</math> space to a Euclidean space <math>\mathbb{R}^m</math>. Let <math>\sigma \in C(\mathbb{R}, \mathbb{R})</math>. Note that <math>(\sigma \circ x)_i = \sigma(x_i)</math>, so <math>\sigma \circ x</math> denotes <math>\sigma</math> applied to each component of <math>x</math>.