Revision as of 13:07, 3 March 2024 edit Chris Capoccia (talk \| contribs) Extended confirmed users 91,016 edits reformatted refs ← Previous edit		Revision as of 13:10, 3 March 2024 edit undo Citation bot (talk \| contribs) Bots 5,861,698 edits Add: s2cid, arxiv, page, bibcode, pmid. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Chris Capoccia \| #UCB_toolbar Next edit →
Line 4: [[Artificial neural networks]] are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued [[vector (mathematics and physics)\|vectors]] to real-valued [[vector (mathematics and physics)\|vectors]]. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces. In the [[mathematics\|mathematical]] theory of [[artificial neural networks]], '''universal approximation theorems''' are results<ref name=MLP-UA>{{cite journal \|last1=Hornik \|first1=Kurt \|last2=Stinchcombe \|first2=Maxwell \|last3=White \|first3=Halbert \|title=Multilayer feedforward networks are universal approximators \|journal=Neural Networks \|date=January 1989 \|volume=2 \|issue=5 \|pages=359–366 \|doi=10.1016/0893-6080(89)90020-8 }}</ref><ref>Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary</ref> that put limits on what neural networks can theoretically learn. Specifically, given an algorithm that generates the networks within a class of functions, the theorems establish the [[dense set\|density]] of the generated functions within a given function space of interest. Typically, these results concern the approximation capabilities of the [[feedforward neural network\|feedforward architecture]] on the space of continuous functions between two [[Euclidean space]]s, and the approximation is with respect to the [[compact convergence]] topology. What must be stressed, is that while some functions can be arbitrarily well approximated in a region, the proofs do not apply outside of the region, i.e. the approximated functions do not [[extrapolate]] outside of the region. That applies for all non-periodic [[activation function]]s, i.e. what's in practice used and most proofs assume. In recent years neocortical pyramidal neurons with oscillating activation function that can individually learn the XOR function have been discovered in the human brain and oscillating activation functions have been explored and shown to outperform popular activation functions on a variety of benchmarks.<ref>{{cite journal \|last1=Gidon \|first1=Albert \|last2=Zolnik \|first2=Timothy Adam \|last3=Fidzinski \|first3=Pawel \|last4=Bolduan \|first4=Felix \|last5=Papoutsi \|first5=Athanasia \|last6=Poirazi \|first6=Panayiota \|last7=Holtkamp \|first7=Martin \|last8=Vida \|first8=Imre \|last9=Larkum \|first9=Matthew Evan \|title=Dendritic action potentials and computation in human layer 2/3 cortical neurons \|journal=Science \|date=3 January 2020 \|volume=367 \|issue=6473 \|pages=83–87 \|doi=10.1126/science.aax6239 \|pmid=31896716 \|bibcode=2020Sci...367...83G }}</ref> However, there are also a variety of results between [[non-Euclidean space]]s<ref name=NonEuclidean>{{Cite conference\|last1=Kratsios\|first1=Anastasis\|last2=Bilokopytov\|first2=Eugene\|date=2020\|title=Non-Euclidean Universal Approximation\|url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf\|publisher=Curran Associates\|journal=Advances in Neural Information Processing Systems \|volume=33}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal \|doi=10.1016/j.acha.2019.06.004 \|arxiv=1805.10769\|title=Universality of deep convolutional neural networks\|year=2020\|last1=Zhou\|first1=Ding-Xuan\|journal=[[Applied and Computational Harmonic Analysis]]\|volume=48\|issue=2\|pages=787–794\|s2cid=44113176}}</ref><ref>{{Cite journal\|doi = 10.1109/LSP.2020.3005051\|title = Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets\|year = 2020\|last1 = Heinecke\|first1 = Andreas\|last2 = Ho\|first2 = Jinn\|last3 = Hwang\|first3 = Wen-Liang\|journal = IEEE Signal Processing Letters\|volume = 27\|pages = 1175–1179\|bibcode = 2020ISPL...27.1175H\|s2cid = 220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal\|doi=10.1162/neco.1991.3.2.246\|title=Universal Approximation Using Radial-Basis-Function Networks\|year=1991\|last1=Park\|first1=J.\|last2=Sandberg\|first2=I. W.\|journal=Neural Computation\|volume=3\|issue=2\|pages=246–257\|pmid=31167308\|s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal \|doi=10.1007/s00365-021-09546-1\|arxiv=1804.10306\|title=Universal Approximations of Invariant Maps by Neural Networks\|year=2021\|last1=Yarotsky\|first1=Dmitry\|journal=Constructive Approximation\|volume=55 \|pages=407–474 \|s2cid=13745401}}</ref><ref>{{cite journal \|last1=Zakwan \|first1=Muhammad \|last2=d’Angelo \|first2=Massimiliano \|last3=Ferrari-Trecate \|first3=Giancarlo \|title=Universal Approximation Property of Hamiltonian Deep Neural Networks \|journal=IEEE Control Systems Letters \|date=2023 \|~~pages~~page=1 \|doi=10.1109/LCSYS.2023.3288350 \|arxiv=2303.12147 \|s2cid=257663609 }}</ref> Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case). Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important [[hyperparameter]]. The choice of an [[activation function]] is also important, and some work, and proofs written about, assume e.g. [[ReLU]] (or [[sigmoid function\|sigmoid]]) used, while some, such as a linear are known to ''not'' work (nor any polynominal). Line 17: One of the first versions of the ''arbitrary width'' case was proven by [[George Cybenko]] in 1989 for [[sigmoid function\|sigmoid]] activation functions.<ref name=cyb>{{cite journal \|citeseerx=10.1.1.441.7873 \|doi=10.1007/BF02551274\|title=Approximation by superpositions of a sigmoidal function\|year=1989\|last1=Cybenko\|first1=G.\|journal=Mathematics of Control, Signals, and Systems\|volume=2\|issue=4\|pages=303–314\|s2cid=3958369}}</ref> {{ill\|Kurt Hornik\|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name=horn>{{Cite journal\|doi=10.1016/0893-6080(91)90009-T\|title=Approximation capabilities of multilayer feedforward networks\|year=1991\|last1=Hornik\|first1=Kurt\|journal=Neural Networks\|volume=4\|issue=2\|pages=251–257\|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name=leshno>{{Cite journal\|last1=Leshno\|first1=Moshe\|last2=Lin\|first2=Vladimir Ya.\|last3=Pinkus\|first3=Allan\|last4=Schocken\|first4=Shimon\|date=January 1993\|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function\|journal=Neural Networks\|volume=6\|issue=6\|pages=861–867\|doi=10.1016/S0893-6080(05)80131-5\|s2cid=206089312\|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name=pinkus>{{Cite journal\|last=Pinkus\|first=Allan\|date=January 1999\|title=Approximation theory of the MLP model in neural networks\|journal=Acta Numerica\|volume=8\|pages=143–195\|doi=10.1017/S0962492900002919\|bibcode=1999AcNum...8..143P\|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang<ref>{{cite journal \|last1=Shen \|first1=Zuowei \|last2=Yang \|first2=Haizhao \|last3=Zhang \|first3=Shijun \|title=Optimal approximation rate of ReLU networks in terms of width and depth \|journal=Journal de Mathématiques Pures et Appliquées \|date=January 2022 \|volume=157 \|pages=101–135 \|doi=10.1016/j.matpur.2021.07.009 \|s2cid=232075797 \|arxiv=2103.00502 }}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks. The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal\|last1=Gripenberg\|first1=Gustaf\|date=June 2003\|title= Approximation by neural networks with a bounded number of nodes at each level\|journal= Journal of Approximation Theory \|volume=122\|issue=2\|pages=260–266\|doi= 10.1016/S0021-9045(03)00078-9 \|doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal \|last1=Yarotsky \|first1=Dmitry \|title=Error bounds for approximations with deep ReLU networks \|journal=Neural Networks \|date=October 2017 \|volume=94 \|pages=103–114 \|doi=10.1016/j.neunet.2017.07.002 \|pmid=28756334 \|arxiv=1610.01145 \|s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal \|last1=Lu \|first1=Zhou \|last2=Pu \|first2=Hongming \|last3=Wang \|first3=Feicheng \|last4=Hu \|first4=Zhiqiang \|last5=Wang \|first5=Liwei \|title=The Expressive Power of Neural Networks: A View from the Width \|journal=Advances in Neural Information Processing Systems \|volume=30 \|year=2017 \|pages=6231–6239 \|url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width \|publisher=Curran Associates \|arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv \|last1=Hanin\|first1=Boris\|last2=Sellke\|first2=Mark\|title=Approximating Continuous Functions by ReLU Nets of Minimal Width\|eprint=1710.11278\|class=stat.ML\|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference\|last1=Kidger\|first1=Patrick\|last2=Lyons\|first2=Terry\|date=July 2020\|title=Universal Approximation with Deep Narrow Networks\|arxiv=1905.08539\|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal \|last1=Kratsios \|first1=Anastasis \|last2=Papon \|first2=Léonie \|date=2022 \|title=Universal Approximation Theorems for Differentiable Geometric Deep Learning \|url=http://jmlr.org/papers/v23/21-0716.html \|journal=Journal of Machine Learning Research \|volume=23 \|issue=196 \|pages=1–73 \|arxiv=2101.05390 }}</ref> who derived explicit depth estimates depending on the regularity of the target function and of the activation function. The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space\|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)\|ReLU]] as activation functions.<ref name="park">{{Cite conference \|last1=Park \|first1=Sejun \|last2=Yun \|first2=Chulhee \|last3=Lee \|first3=Jaeho \|last4=Shin \|first4=Jinwoo \|date=2021 \|title=Minimum Width for Universal Approximation \|conference=International Conference on Learning Representations \|arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory\|control-theoretic]] arguments.<ref>{{Cite conference \|last1=Tabuada \|first1=Paulo \|last2=Gharesifard \|first2=Bahman \|date=2021 \|title=Universal approximation power of deep residual neural networks via nonlinear control theory \|conference=International Conference on Learning Representations \|arxiv=2007.06007}}</ref><ref>{{cite journal \|last1=Tabuada \|first1=Paulo \|last2=Gharesifard \|first2=Bahman \|title=Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control \|journal=IEEE Transactions on Automatic Control \|date=May 2023 \|volume=68 \|issue=5 \|pages=2715–2728 \|doi=10.1109/TAC.2022.3190051 \|s2cid=250512115 }}</ref> In 2023, Cai<ref name=":1">{{Cite journal \|last=Cai \|first=Yongqiang \|date=2023-02-01 \|title=Achieve the Minimum Width of Neural Networks for Universal Approximation \|url=https://openreview.net/forum?id=hfUJ4ShyDEU \|journal=ICLR \|arxiv=2209.11395 \|language=en}}</ref> obtained the optimal minimum width bound for the universal approximation. Line 27: random neural networks,<ref>{{Cite journal\|doi=10.1109/72.737488\|title=Function approximation with spiked random networks\|year=1999\|last1=Gelenbe\|first1=Erol\|last2=Mao\|first2= Zhi Hong\|last3=Li\|first3=Yan D.\|journal=IEEE Transactions on Neural Networks\|volume=10\|issue=1\|pages=3–9\|pmid=18252498 \|url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference\|last1=Lin\|first1=Hongzhou\|last2=Jegelka\|first2=Stefanie\|date=2018\|title=ResNet with one-neuron hidden layers is a Universal Approximator\|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator\|publisher=Curran Associates\|pages=6169–6178\|journal=Advances in Neural Information Processing Systems \|volume=30}}</ref> In 2023 it was published that a three-layer neural network can approximate any function (continuous and not continuous)<ref>{{cite journal \|last1=Ismailov \|first1=Vugar E. \|title=A three layer neural network can represent any multivariate function \|journal=Journal of Mathematical Analysis and Applications \|date=July 2023 \|volume=523 \|issue=1 \|pages=127096 \|doi=10.1016/j.jmaa.2023.127096 \|arxiv=2012.03016 \|s2cid=265100963 }}</ref> == Arbitrary-width case ==

Universal approximation theorem: Difference between revisions