Universal approximation theorem: Difference between revisions

Content deleted Content added
cleanup
Line 17:
== History ==
 
=== ClassicalArbitrary resultswidth ===
One of theThe first versionsexamples ofwere the ''arbitrary width'' case was proven by. [[George Cybenko]] in 1989 proved it for [[sigmoid function|sigmoid]] activation functions.<ref name="cyb">{{cite journal |citeseerx=10.1.1.441.7873 |doi=10.1007/BF02551274|title=Approximation by superpositions of a sigmoidal function|year=1989|last1=Cybenko|first1=G.|journal=Mathematics of Control, Signals, and Systems|volume=2|issue=4|pages=303–314|s2cid=3958369}}</ref> {{ill|Kurt Hornik|de}}, Maxwell Stinchcombe, and [[Halbert White]] showed in 1989 that multilayer [[feed-forward network]]s with as few as one hidden layer are universal approximators.<ref name="MLP-UA" /> Hornik also showed in 1991<ref name="horn">{{Cite journal|doi=10.1016/0893-6080(91)90009-T|title=Approximation capabilities of multilayer feedforward networks|year=1991|last1=Hornik|first1=Kurt|journal=Neural Networks|volume=4|issue=2|pages=251–257|s2cid=7343126 }}</ref> that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993<ref name="leshno">{{Cite journal|last1=Leshno|first1=Moshe|last2=Lin|first2=Vladimir Ya.|last3=Pinkus|first3=Allan|last4=Schocken|first4=Shimon|date=January 1993|title=Multilayer feedforward networks with a nonpolynomial activation function can approximate any function|journal=Neural Networks|volume=6|issue=6|pages=861–867|doi=10.1016/S0893-6080(05)80131-5|s2cid=206089312|url=http://archive.nyu.edu/handle/2451/14329 }}</ref> and later Allan Pinkus in 1999<ref name="pinkus">{{Cite journal|last=Pinkus|first=Allan|date=January 1999|title=Approximation theory of the MLP model in neural networks|journal=Acta Numerica|volume=8|pages=143–195|doi=10.1017/S0962492900002919|bibcode=1999AcNum...8..143P|s2cid=16800260 }}</ref> showed that the universal approximation property is equivalent to having a nonpolynomial activation function.
 
=== ModernArbitrary resultsdepth ===
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,<ref name= gripenberg >{{Cite journal|last1=Gripenberg|first1=Gustaf|date=June 2003|title= Approximation by neural networks with a bounded number of nodes at each level|journal= Journal of Approximation Theory |volume=122|issue=2|pages=260–266|doi= 10.1016/S0021-9045(03)00078-9 |doi-access=}}</ref> Dmitry Yarotsky,<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |title=Error bounds for approximations with deep ReLU networks |journal=Neural Networks |date=October 2017 |volume=94 |pages=103–114 |doi=10.1016/j.neunet.2017.07.002 |pmid=28756334 |arxiv=1610.01145 |s2cid=426133 }}</ref> Zhou Lu ''et al'' in 2017,<ref name="ZhouLu">{{cite journal |last1=Lu |first1=Zhou |last2=Pu |first2=Hongming |last3=Wang |first3=Feicheng |last4=Hu |first4=Zhiqiang |last5=Wang |first5=Liwei |title=The Expressive Power of Neural Networks: A View from the Width |journal=Advances in Neural Information Processing Systems |volume=30 |year=2017 |pages=6231–6239 |url=http://papers.nips.cc/paper/7203-the-expressive-power-of-neural-networks-a-view-from-the-width |publisher=Curran Associates |arxiv=1709.02540 }}</ref> Boris Hanin and Mark Sellke in 2018<ref name=hanin>{{cite arXiv |last1=Hanin|first1=Boris|last2=Sellke|first2=Mark|title=Approximating Continuous Functions by ReLU Nets of Minimal Width|eprint=1710.11278|class=stat.ML|date=2018}}</ref> who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons<ref name=kidger>{{Cite conference|last1=Kidger|first1=Patrick|last2=Lyons|first2=Terry|date=July 2020|title=Universal Approximation with Deep Narrow Networks|arxiv=1905.08539|conference=Conference on Learning Theory}}</ref> extended those results to neural networks with ''general activation functions'' such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal |last1=Kratsios |first1=Anastasis |last2=Papon |first2=Léonie |date=2022 |title=Universal Approximation Theorems for Differentiable Geometric Deep Learning |url=http://jmlr.org/papers/v23/21-0716.html |journal=Journal of Machine Learning Research |volume=23 |issue=196 |pages=1–73 |arxiv=2101.05390 }}</ref> who derived explicit depth estimates depending on the regularity of the target function and of the activation function.
 
=== Bounded depth and bounded width ===
The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)|ReLU]] as activation functions.<ref name="park">{{Cite conference |last1=Park |first1=Sejun |last2=Yun |first2=Chulhee |last3=Lee |first3=Jaeho |last4=Shin |first4=Jinwoo |date=2021 |title=Minimum Width for Universal Approximation |conference=International Conference on Learning Representations |arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory|control-theoretic]] arguments.<ref>{{Cite conference |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |date=2021 |title=Universal approximation power of deep residual neural networks via nonlinear control theory |conference=International Conference on Learning Representations |arxiv=2007.06007}}</ref><ref>{{cite journal |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |title=Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control |journal=IEEE Transactions on Automatic Control |date=May 2023 |volume=68 |issue=5 |pages=2715–2728 |doi=10.1109/TAC.2022.3190051 |s2cid=250512115 }}{{Erratum|doi=10.1109/TAC.2024.3390099|checked=yes}}</ref> In 2023, Cai<ref name=":1">{{Cite journal |last=Cai |first=Yongqiang |date=2023-02-01 |title=Achieve the Minimum Width of Neural Networks for Universal Approximation |url=https://openreview.net/forum?id=hfUJ4ShyDEU |journal=ICLR |arxiv=2209.11395 |language=en}}</ref> obtained the optimal minimum width bound for the universal approximation.
The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.<ref name="maiorov">{{Cite journal|last1=Maiorov|first1=Vitaly|last2=Pinkus|first2=Allan|date=April 1999|title=Lower bounds for approximation by MLP neural networks|journal=Neurocomputing|volume=25|issue=1–3|pages=81–91|doi=10.1016/S0925-2312(98)00111-8}}</ref> They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators.
 
The bounded depthGuliyev and bounded width case was first studied by Maiorov and Pinkus in 1999.Ismailov<ref name=maiorov"guliyev1">{{Cite journal |last1=MaiorovGuliyev |first1=VitalyNamig |last2=PinkusIsmailov |first2=AllanVugar |date=AprilNovember 2018 1999|title=LowerApproximation boundscapability forof approximationtwo byhidden MLPlayer feedforward neural networks with fixed weights |journal=Neurocomputing |volume=25316 |issuepages=1–3262–269 |pagesarxiv=81–912101.09181 |doi=10.1016/S0925-2312(98)00111-8j.neucom.2018.07.075 |s2cid=52285996}}</ref> Theyconstructed showeda that there exists an analyticsmooth sigmoidal activation function suchproviding thatuniversal approximation property for two hidden layer feedforward neural networks with bounded number ofless units in hidden layers are universal approximators.
Using algorithmic and computer programming techniques, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers.<ref name=guliyev1>{{Cite journal|last1=Guliyev|first1=Namig|last2=Ismailov|first2=Vugar|date=November 2018|title=Approximation capability of two hidden layer feedforward neural networks with fixed weights|journal=Neurocomputing|volume=316| pages=262–269|doi=10.1016/j.neucom.2018.07.075|arxiv=2101.09181 |s2cid=52285996 }}</ref> It was constructively proved in 2018 paper<ref name=guliyev2>{{Cite journal|last1=Guliyev|first1=Namig|last2=Ismailov|first2=Vugar|date=February 2018|title=On the approximation by single hidden layer feedforward neural networks with fixed weights|journal=Neural Networks|volume=98| pages=296–304|doi=10.1016/j.neunet.2017.12.007|pmid=29301110 |arxiv=1708.06219 |s2cid=4932839 }}</ref> that single hidden layer networks with bounded width are still universal approximators for univariate functions, but this property is no longer true for multivariable functions.
 
Using algorithmic and computer programming techniques, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers.<ref name=guliyev1>{{Cite journal|last1=Guliyev|first1=Namig|last2=Ismailov|first2=Vugar|date=November 2018|title=Approximation capability of two hidden layer feedforward neural networks with fixed weights|journal=Neurocomputing|volume=316| pages=262–269|doi=10.1016/j.neucom.2018.07.075|arxiv=2101.09181 |s2cid=52285996 }}</ref> It was constructively proved in 2018 paper<ref name="guliyev2">{{Cite journal|last1=Guliyev|first1=Namig|last2=Ismailov|first2=Vugar|date=February 2018|title=On the approximation by single hidden layer feedforward neural networks with fixed weights|journal=Neural Networks|volume=98| pages=296–304|doi=10.1016/j.neunet.2017.12.007|pmid=29301110 |arxiv=1708.06219 |s2cid=4932839 }}</ref> thatconstructed single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, but this property is nodoes longernot trueapply for multivariable functions.
Several extensions of the theorem exist, such as to discontinuous activation functions,<ref name=leshno /> noncompact domains,<ref name=kidger /> certifiable networks,<ref>{{cite conference|last1=Baader|first1=Maximilian|last2=Mirman|first2=Matthew|last3=Vechev|first3=Martin|date=2020|title=Universal Approximation with Certified Networks|url=https://openreview.net/forum?id=B1gX8kBtPr|conference=ICLR}}</ref>
random neural networks,<ref>{{Cite journal|doi=10.1109/72.737488|title=Function approximation with spiked random networks|year=1999|last1=Gelenbe|first1=Erol|last2=Mao|first2= Zhi Hong|last3=Li|first3=Yan D.|journal=IEEE Transactions on Neural Networks|volume=10|issue=1|pages=3–9|pmid=18252498 |url=https://zenodo.org/record/6817275 }}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference|last1=Lin|first1=Hongzhou|last2=Jegelka|first2=Stefanie|date=2018|title=ResNet with one-neuron hidden layers is a Universal Approximator|url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator|publisher=Curran Associates|pages=6169–6178|journal=Advances in Neural Information Processing Systems |volume=30}}</ref>
 
<ref>{{cite journal |last1=Ismailov |first1=Vugar E. |date=July 2023 |title=A three layer neural network can represent any multivariate function |journal=Journal of Mathematical Analysis and Applications |volume=523 |issue=1 |pages=127096 |arxiv=2012.03016 |doi=10.1016/j.jmaa.2023.127096 |s2cid=265100963}}</ref> showed that a three-layer neural network can approximate any function (''continuous'' and ''discontinuous'').
 
<ref>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |date=January 2022 |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |volume=157 |pages=101–135 |arxiv=2103.00502 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797}}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
 
=== Quantitative bounds ===
There are also a variety of results between [[non-Euclidean space]]s<ref name="NonEuclidean">{{Cite conference |last1=Kratsios |first1=Anastasis |last2=Bilokopytov |first2=Eugene |date=2020 |title=Non-Euclidean Universal Approximation |url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |last1=Zhou |first1=Ding-Xuan |year=2020 |title=Universality of deep convolutional neural networks |journal=[[Applied and Computational Harmonic Analysis]] |volume=48 |issue=2 |pages=787–794 |arxiv=1805.10769 |doi=10.1016/j.acha.2019.06.004 |s2cid=44113176}}</ref><ref>{{Cite journal |last1=Heinecke |first1=Andreas |last2=Ho |first2=Jinn |last3=Hwang |first3=Wen-Liang |year=2020 |title=Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets |journal=IEEE Signal Processing Letters |volume=27 |pages=1175–1179 |bibcode=2020ISPL...27.1175H |doi=10.1109/LSP.2020.3005051 |s2cid=220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal |last1=Park |first1=J. |last2=Sandberg |first2=I. W. |year=1991 |title=Universal Approximation Using Radial-Basis-Function Networks |journal=Neural Computation |volume=3 |issue=2 |pages=246–257 |doi=10.1162/neco.1991.3.2.246 |pmid=31167308 |s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |year=2021 |title=Universal Approximations of Invariant Maps by Neural Networks |journal=Constructive Approximation |volume=55 |pages=407–474 |arxiv=1804.10306 |doi=10.1007/s00365-021-09546-1 |s2cid=13745401}}</ref><ref>{{cite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |date=2023 |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |journal=IEEE Control Systems Letters |page=1 |arxiv=2303.12147 |doi=10.1109/LCSYS.2023.3288350 |s2cid=257663609}}</ref>
The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''[[Lp space|L<sup>p</sup>]]'' functions using feed-forward neural networks with [[Rectifier (neural networks)|ReLU]] as activation functions.<ref name="park">{{Cite conference |last1=Park |first1=Sejun |last2=Yun |first2=Chulhee |last3=Lee |first3=Jaeho |last4=Shin |first4=Jinwoo |date=2021 |title=Minimum Width for Universal Approximation |conference=International Conference on Learning Representations |arxiv=2006.08859}}</ref> Similar results that can be directly applied to [[residual neural network]]s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using [[Control theory|control-theoretic]] arguments.<ref>{{Cite conference |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |date=2021 |title=Universal approximation power of deep residual neural networks via nonlinear control theory |conference=International Conference on Learning Representations |arxiv=2007.06007}}</ref><ref>{{cite journal |last1=Tabuada |first1=Paulo |last2=Gharesifard |first2=Bahman |date=May 2023 |title=Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control |journal=IEEE Transactions on Automatic Control |date=May 2023 |volume=68 |issue=5 |pages=2715–2728 |doi=10.1109/TAC.2022.3190051 |s2cid=250512115 }}{{Erratum|doi=10.1109/TAC.2024.3390099|checked=yes}}</ref> In 2023, Cai<ref name=":1">{{Cite journal |last=Cai |first=Yongqiang |date=2023-02-01 |title=Achieve the Minimum Width of Neural Networks for Universal Approximation |url=https://openreview.net/forum?id=hfUJ4ShyDEU |journal=ICLR |language=en |arxiv=2209.11395 |language=en}}</ref> obtained the optimal minimum width bound for the universal approximation.
 
For the arbitrary depth case, Leonie Papon and Anastasis Kratsios<ref name="jmlr.org">{{Cite journal |last1=Kratsios |first1=Anastasis |last2=Papon |first2=Léonie |date=2022 |title=Universal Approximation Theorems for Differentiable Geometric Deep Learning |url=http://jmlr.org/papers/v23/21-0716.html |journal=Journal of Machine Learning Research |volume=23 |issue=196 |pages=1–73 |arxiv=2101.05390}}</ref> derived explicit depth estimates depending on the regularity of the target function and of the activation function.
 
=== Kolmogorov network ===
The [[Kolmogorov–Arnold representation theorem]] is similar in spirit. Indeed, certain neural network families can directly apply the Kolmogorov–Arnold theorem to yield a universal approximation theorem.
 
<ref>{{Cite journal |last=H |first=Nielsen R. |date=1987 |title=Kolmogorov's mapping neural network existence theorem |url=https://cir.nii.ac.jp/crid/1572543025788928512 |journal=Proceedings of International Conference on Neural Networks, 1987 |volume=3 |pages=11–13}}</ref> showed that a three-layer neural network can approximate any continuous multivariate function. This was extended to the discontinuous case in <ref>{{cite journal |last1=Ismailov |first1=Vugar E. |date=July 2023 |title=A three layer neural network can represent any multivariate function |journal=Journal of Mathematical Analysis and Applications |volume=523 |issue=1 |pages=127096 |arxiv=2012.03016 |doi=10.1016/j.jmaa.2023.127096 |s2cid=265100963}}</ref>.
 
<ref>{{Citation |last=Liu |first=Ziming |title=KAN: Kolmogorov-Arnold Networks |date=2024-05-24 |url=http://arxiv.org/abs/2404.19756 |access-date=2024-06-03 |doi=10.48550/arXiv.2404.19756 |last2=Wang |first2=Yixuan |last3=Vaidya |first3=Sachin |last4=Ruehle |first4=Fabian |last5=Halverson |first5=James |last6=Soljačić |first6=Marin |last7=Hou |first7=Thomas Y. |last8=Tegmark |first8=Max}}</ref> shows a practical application.
 
=== Variants ===
Several extensions of the theorem exist, such as to discontinuousDiscontinuous activation functions,<ref name="leshno" /> noncompact domains,<ref name="kidger" /> certifiable networks,<ref>{{cite conference |last1=Baader |first1=Maximilian |last2=Mirman |first2=Matthew |last3=Vechev |first3=Martin |date=2020 |title=Universal Approximation with Certified Networks |url=https://openreview.net/forum?id=B1gX8kBtPr |conference=ICLR}}</ref>
random neural networks,<ref>{{Cite journal|doi=10.1109/72.737488|title=Function approximation with spiked random networks|year=1999|last1=Gelenbe |first1=Erol |last2=Mao |first2= Zhi Hong |last3=Li |first3=Yan D. |year=1999 |title=Function approximation with spiked random networks |url=https://zenodo.org/record/6817275 |journal=IEEE Transactions on Neural Networks |volume=10 |issue=1 |pages=3–9|pmid=18252498 |urldoi=https://zenodo10.org1109/record/681727572.737488 |pmid=18252498}}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference |last1=Lin |first1=Hongzhou |last2=Jegelka |first2=Stefanie |date=2018 |title=ResNet with one-neuron hidden layers is a Universal Approximator |url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator |publisher=Curran Associates |volume=30 |pages=6169–6178 |journal=Advances in Neural Information Processing Systems |volume=30}}</ref>
 
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy the minimum width required for the universal approximation of the ''[[Lp space|L<sup>p</sup>]]'' functions is exactly max{dx + 1, dy} (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate C(K, R dy) in general while maintaining width max{dx + 1, dy}? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function|threshold activation function]] are used.<ref name="park" />
 
Achieving useful universalUniversal function approximation on graphs (or rather on [[Graph isomorphism|graph isomorphism classes]]) hasby beenpopular a[[Graph longstanding problem. The popularneural network|graph convolutional neural networks]] (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.<ref name="PowerGNNs">{{Cite conference |last1=Xu |first1=Keyulu |last2=Hu |first2=Weihua |last3=Leskovec |first3=Jure |last4=Jegelka |first4=Stefanie |date=2019 |title=How Powerful are Graph Neural Networks? |url=https://openreview.net/forum?id=ryGs6iA5Km |journal=International Conference on Learning Representations}}</ref> In 2020,<ref name="UniversalGraphs">{{Cite conference |last1=Brüel-Gabrielsson |first1=Rickard |date=2020 |title=Universal Function Approximation on Graphs |url=https://proceedings.neurips.cc//paper/2020/hash/e4acb4c86de9d2d9a41364f93951028d-Abstract.html |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems |volume=33}}</ref> a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying <math>\mathcal O(\left|V\right| \cdot \left|E\right|)</math>-runtime method that performed at state of the art on a collection of benchmarks (where <math>V</math> and <math>E</math> are the sets of nodes and edges of the graph respectively).
 
There are also a variety of results between [[non-Euclidean space]]s<ref name="NonEuclidean">{{Cite conference |last1=Kratsios |first1=Anastasis |last2=Bilokopytov |first2=Eugene |date=2020 |title=Non-Euclidean Universal Approximation |url=https://papers.nips.cc/paper/2020/file/786ab8c4d7ee758f80d57e65582e609d-Paper.pdf |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems}}</ref> and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the [[convolutional neural network]] (CNN) architecture,<ref>{{cite journal |last1=Zhou |first1=Ding-Xuan |year=2020 |title=Universality of deep convolutional neural networks |journal=[[Applied and Computational Harmonic Analysis]] |volume=48 |issue=2 |pages=787–794 |arxiv=1805.10769 |doi=10.1016/j.acha.2019.06.004 |s2cid=44113176}}</ref><ref>{{Cite journal |last1=Heinecke |first1=Andreas |last2=Ho |first2=Jinn |last3=Hwang |first3=Wen-Liang |year=2020 |title=Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets |journal=IEEE Signal Processing Letters |volume=27 |pages=1175–1179 |bibcode=2020ISPL...27.1175H |doi=10.1109/LSP.2020.3005051 |s2cid=220669183}}</ref> [[radial basis functions]],<ref>{{Cite journal |last1=Park |first1=J. |last2=Sandberg |first2=I. W. |year=1991 |title=Universal Approximation Using Radial-Basis-Function Networks |journal=Neural Computation |volume=3 |issue=2 |pages=246–257 |doi=10.1162/neco.1991.3.2.246 |pmid=31167308 |s2cid=34868087}}</ref> or neural networks with specific properties.<ref>{{cite journal |last1=Yarotsky |first1=Dmitry |year=2021 |title=Universal Approximations of Invariant Maps by Neural Networks |journal=Constructive Approximation |volume=55 |pages=407–474 |arxiv=1804.10306 |doi=10.1007/s00365-021-09546-1 |s2cid=13745401}}</ref><ref>{{cite journal |last1=Zakwan |first1=Muhammad |last2=d’Angelo |first2=Massimiliano |last3=Ferrari-Trecate |first3=Giancarlo |date=2023 |title=Universal Approximation Property of Hamiltonian Deep Neural Networks |journal=IEEE Control Systems Letters |page=1 |arxiv=2303.12147 |doi=10.1109/LCSYS.2023.3288350 |s2cid=257663609}}</ref>
 
== Arbitrary-width case ==
Line 139 ⟶ 154:
Here “<math> \sigma \colon \mathbb{R} \to \mathbb{R}</math> is <math>\lambda</math>-strictly increasing on some set <math>X</math>” means that there exists a strictly increasing function <math>u \colon X \to \mathbb{R}</math> such that <math>|\sigma(x) - u(x)| \le \lambda</math> for all <math>x \in X</math>. Clearly, a <math>\lambda</math>-increasing function behaves like a usual increasing function as <math>\lambda</math> gets small.
In the "''depth-width''" terminology, the above theorem says that for certain activation functions depth-<math>2</math> width-<math>2</math> networks are universal approximators for univariate functions and depth-<math>3</math> width-<math> (2d+2) </math> networks are universal approximators for <math>d</math>-variable functions (<math>d>1</math>).
 
== Graph input ==
 
Achieving useful universal function approximation on graphs (or rather on [[Graph isomorphism|graph isomorphism classes]]) has been a longstanding problem. The popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.<ref name=PowerGNNs>{{Cite conference|last1=Xu|first1=Keyulu|last2=Hu|first2=Weihua|last3=Leskovec|first3=Jure|last4=Jegelka|first4=Stefanie|date=2019|title=How Powerful are Graph Neural Networks?|url=https://openreview.net/forum?id=ryGs6iA5Km|journal=International Conference on Learning Representations}}</ref> In 2020,<ref name=UniversalGraphs>{{Cite conference|last1=Brüel-Gabrielsson|first1=Rickard|date=2020|title=Universal Function Approximation on Graphs|url=https://proceedings.neurips.cc//paper/2020/hash/e4acb4c86de9d2d9a41364f93951028d-Abstract.html|publisher=Curran Associates|journal=Advances in Neural Information Processing Systems |volume=33}}</ref> a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying <math>\mathcal O(\left|V\right| \cdot \left|E\right|)</math>-runtime method that performed at state of the art on a collection of benchmarks (where <math>V</math> and <math>E</math> are the sets of nodes and edges of the graph respectively).
 
== See also ==