Revision as of 08:39, 27 February 2025 edit 2601:644:600:46d0:6220:8afc:7ec5:14a9 (talk) →Arbitrary depth: See the bottom of the GeLU paper (https://arxiv.org/pdf/1606.08415). Swish is a specific case of the GeLU and was discovered afterwards. Swish is now no longer used ← Previous edit		Revision as of 10:44, 28 March 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Arbitrary-width case: cybenko 1988 Tag: Visual edit Next edit →
Line 91: The case where <math>\sigma</math> is a generic non-polynomial function is harder, and the reader is directed to.<ref name="pinkus" />}} The above proof has not specified how one might use a ramp function to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>. A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the [[Dirac delta function]], then use those to approximate arbitrary functions in <math>C_0(\R^n, \R)</math>.<ref>{{Cite book \|last=Nielsen \|first=Michael A. \|date=2015 \|title=Neural Networks and Deep Learning \|url=http://neuralnetworksanddeeplearning.com/ \|language=en}}</ref> The original proofs, such as the one by Cybenko, use methods from functional analysis, including the [[Hahn–Banach theorem\|Hahn-Banach]] and [[Riesz–Markov–Kakutani representation theorem\|Riesz–Markov–Kakutani representation]] theorems. Cybenko first published the theorem in a technical report in 1988,<ref>G. Cybenko, "Continuous Valued Neural Networks with Two Hidden Layers are Sufficient", Technical Report, Department of Computer Science, Tufts University, 1988.</ref> then as a paper in 1989.<ref name="cyb" /> Notice also that the neural network is only required to approximate within a compact set <math>K</math>. The proof does not describe how the function would be extrapolated outside of the region.

Universal approximation theorem: Difference between revisions