Universal approximation theorem: Difference between revisions

Content deleted Content added
fix citations
mNo edit summary
 
(2 intermediate revisions by 2 users not shown)
Line 28:
In 2018, Guliyev and Ismailov<ref name="guliyev1">{{Cite journal |last1=Guliyev |first1=Namig |last2=Ismailov |first2=Vugar |date=November 2018 |title=Approximation capability of two hidden layer feedforward neural networks with fixed weights |journal=Neurocomputing |volume=316 |pages=262–269 |arxiv=2101.09181 |doi=10.1016/j.neucom.2018.07.075 |s2cid=52285996}}</ref> constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. In 2018, they also constructed<ref name="guliyev2">{{Cite journal|last1=Guliyev|first1=Namig|last2=Ismailov|first2=Vugar|date=February 2018|title=On the approximation by single hidden layer feedforward neural networks with fixed weights|journal=Neural Networks|volume=98| pages=296–304|doi=10.1016/j.neunet.2017.12.007|pmid=29301110 |arxiv=1708.06219 |s2cid=4932839 }}</ref> single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions.
 
In 2022, Shen ''et al.''<ref name=shen22>{{cite journal |last1=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |date=January 2022 |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |volume=157 |pages=101–135 |arxiv=2103.00502 |doi=10.1016/j.matpur.2021.07.009 |s2cid=232075797}}</ref> obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
 
=== Quantitative bounds ===
Line 45:
random neural networks,<ref>{{Cite journal |last1=Gelenbe |first1=Erol |last2=Mao |first2=Zhi Hong |last3=Li |first3=Yan D. |year=1999 |title=Function approximation with spiked random networks |url=https://zenodo.org/record/6817275 |journal=IEEE Transactions on Neural Networks |volume=10 |issue=1 |pages=3–9 |doi=10.1109/72.737488 |pmid=18252498}}</ref> and alternative network architectures and topologies.<ref name="kidger" /><ref>{{Cite conference |last1=Lin |first1=Hongzhou |last2=Jegelka |first2=Stefanie|author2-link=Stefanie Jegelka |date=2018 |title=ResNet with one-neuron hidden layers is a Universal Approximator |url=https://papers.nips.cc/paper/7855-resnet-with-one-neuron-hidden-layers-is-a-universal-approximator |publisher=Curran Associates |volume=30 |pages=6169–6178 |journal=Advances in Neural Information Processing Systems}}</ref>
 
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx<math>d_x</math> and output dimension dy<math>d_y</math> the minimum width required for the universal approximation of the ''[[Lp space|L<sup>p</sup>]]'' functions is exactly <math>max\{dxd_x + 1, dyd_y\}</math> (for a ReLU network). <!-- ReLU alone is not sufficient in general "In light of Theorem 2, is it impossible to approximate <math>C(K, R dyd_y)</math> in general while maintaining width <math>max\{dxd_x + 1, dyd_y\}</math>? Theorem 3 shows that an additional activation comes to rescue." --> More generally this also holds if ''both'' ReLU and a [[step function|threshold activation function]] are used.<ref name="park" />
 
Universal function approximation on graphs (or rather on [[Graph isomorphism|graph isomorphism classes]]) by popular [[Graph neural network|graph convolutional neural networks]] (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.<ref name="PowerGNNs">{{Cite conference |last1=Xu |first1=Keyulu |last2=Hu |first2=Weihua |last3=Leskovec |first3=Jure |last4=Jegelka |first4=Stefanie|author4-link=Stefanie Jegelka |date=2019 |title=How Powerful are Graph Neural Networks? |url=https://openreview.net/forum?id=ryGs6iA5Km |journal=International Conference on Learning Representations}}</ref> In 2020,<ref name="UniversalGraphs">{{Cite conference |last1=Brüel-Gabrielsson |first1=Rickard |date=2020 |title=Universal Function Approximation on Graphs |url=https://proceedings.neurips.cc//paper/2020/hash/e4acb4c86de9d2d9a41364f93951028d-Abstract.html |publisher=Curran Associates |volume=33 |journal=Advances in Neural Information Processing Systems}}</ref> a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying <math>\mathcal O(\left|V\right| \cdot \left|E\right|)</math>-runtime method that performed at state of the art on a collection of benchmarks (where <math>V</math> and <math>E</math> are the sets of nodes and edges of the graph respectively).
Line 101:
 
== Arbitrary-depth case ==
The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017.<ref name=ZhouLu /> They showed that networks of width ''n''&nbsp;+&nbsp;4 with [[ReLU]] activation functions can approximate any [[Lebesgue integration|Lebesgue-integrable function]] on ''n''-dimensional input space with respect to [[L1 distance|<math>L^1</math> distance]] if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper<ref name=ZhouLu /> it was shown that [[ReLU]] networks with width ''n''&nbsp;+&nbsp;1 were sufficient to approximate any [[continuous function|continuous]] function of ''n''-dimensional input variables.<ref>Hanin, B. (2018). [[arxiv:1710.11278|Approximating Continuous Functions by ReLU Nets of Minimal Width]]. arXiv preprint arXiv:1710.11278.<name=hanin/ref> The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.<ref>{{Cite journal |last=Park, Yun, Lee, Shin |first=Sejun, Chulhee, Jaeho, Jinwoo |date=2020-09-28 |title=Minimum Width for Universal Approximation |url=https://openreview.net/forum?id=O-XJwyoIF-k |journal=ICLR |arxiv=2006.08859 |language=en}}</ref>
 
{{math theorem
Line 111:
Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact ___domain, then the exact minimum width is<ref name=":1" /> <math>d_m = \max\{n, m, 2\}</math>.
 
''Quantitative refinement:'' In the case where <math>f:[0, 1]^n \rightarrow \mathbb{R} </math>, (i.e. <math> m = 1 </math>) and <math>\sigma</math> is the [[Rectifier (neural networks)|ReLU activation function]], the exact depth and width for a ReLU network to achieve <math>\varepsilon</math> error is also known.<ref>{{cite journal |last1name=Shen |first1=Zuowei |last2=Yang |first2=Haizhao |last3=Zhang |first3=Shijun |title=Optimal approximation rate of ReLU networks in terms of width and depth |journal=Journal de Mathématiques Pures et Appliquées |date=January 2022 |volume=157 |pages=101–135 |doi=10.1016shen22/j.matpur.2021.07.009 |arxiv=2103.00502 |s2cid = 232075797 }}</ref> If, moreover, the target function <math>f</math> is smooth, then the required number of layer and their width can be exponentially smaller.<ref>{{cite journal |last1=Lu |first1=Jianfeng |last2=Shen |first2=Zuowei |last3=Yang |first3=Haizhao |last4=Zhang |first4=Shijun |title=Deep Network Approximation for Smooth Functions |journal = SIAM Journal on Mathematical Analysis |date=January 2021 |volume=53 |issue=5 |pages=5465–5506 |doi=10.1137/20M134695X |arxiv=2001.03040 |s2cid=210116459 }}</ref> Even if <math>f</math> is not smooth, the curse of dimensionality can be broken if <math>f</math> admits additional "compositional structure".<ref>{{Cite journal |last1=Juditsky |first1=Anatoli B. |last2=Lepski |first2=Oleg V. |last3=Tsybakov |first3=Alexandre B. |date=2009-06-01 |title=Nonparametric estimation of composite functions |journal=The Annals of Statistics |volume=37 |issue=3 |doi=10.1214/08-aos611 |s2cid=2471890 |issn=0090-5364|doi-access=free |arxiv=0906.0865 }}</ref><ref>{{Cite journal |last1=Poggio |first1=Tomaso |last2=Mhaskar |first2=Hrushikesh |last3=Rosasco |first3=Lorenzo |last4=Miranda |first4=Brando |last5=Liao |first5=Qianli |date=2017-03-14 |title=Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review |journal=International Journal of Automation and Computing |volume=14 |issue=5 |pages=503–519 |doi=10.1007/s11633-017-1054-2 |s2cid=15562587 |issn=1476-8186|doi-access=free |arxiv=1611.00740 }}</ref>
}}