Mathematics of neural networks in machine learning: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:31, 20 August 2019 edit Michael Hardy (talk \| contribs) Administrators 210,577 edits No edit summary ← Previous edit		Latest revision as of 18:13, 30 June 2025 edit undo 1blulere (talk \| contribs) 17 edits m Update main template with correct article name; if anyone else wishes to normalise 'ANN' -> 'neural network' throughout the article, feel free to Do so Tag: Visual edit
(24 intermediate revisions by 18 users not shown)
Line 1: {{~~Main~~Short description\|~~Artificial~~Type ~~neural~~of network}} {{Main\|Neural network (machine learning)}} An '''artificial neural network''' (ANN) or '''neural network''' combines biological principles with advanced statistics to solve problems in domains such as [[pattern recognition]] and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways. == Structure == === Neuron === A neuron with label <math>j</math> receiving an input <math>p_j(t)</math> from predecessor neurons consists of the following components:<ref name="Zell1994ch5.2">{{Cite book~~\|url=http://worldcat.org/oclc/249017987~~\|title=Simulation neuronaler Netze\|last=Zell\|first=Andreas\|date=2003\|publisher=Addison-Wesley\|isbn=978-3-89319-554-1\|edition=1st\|language=German\|trans-title=Simulation of Neural Networks\|chapter=chapter 5.2\|oclc=249017987}}</ref> * an ''activation'' <math>a_j(t)</math>, the neuron's state, depending on a discrete time parameter, Line 12 ⟶ 13: * an ''activation function'' <math>f</math> that computes the new activation at a given time <math>t+1</math> from <math>a_j(t)</math>, <math>\theta_j</math> and the net input <math>p_j(t)</math> giving rise to the relation :: <math> a_j(t+1) = f(a_j(t), p_j(t), \theta_j), </math> * and an ''output function'' <math>f_\text{out}</math> computing the output from the activation :: <math> o_j(t) = f_\text{out}(a_j(t)). </math> Often the output function is simply the [[identity function]]. Line 23 ⟶ 24: === Propagation function === The ''propagation function'' computes the ''input'' <math>p_j(t)</math> to the neuron <math>j</math> from the outputs <math>o_i(t)</math>and typically has the form<ref name="Zell1994ch5.222">{{Cite book\|url=http://worldcat.org/oclc/249017987\|title=Simulation neuronaler Netze\|last=Zell\|first=Andreas\|date=2003\|publisher=Addison-Wesley\|isbn=978-3-89319-554-1\|edition=1st\|language=German\|trans-title=Simulation of Neural Networks\|chapter=chapter 5.2\|oclc=249017987}}</~~ref~~> : <math> p_j(t) = \~~sum_{i}~~sum_i o_i(t) w_{ij}. </math> === Bias === A bias term can be added, changing the form to the following:<ref name="DAWSON1998">{{cite journal\|last1=DAWSON\|first1=CHRISTIAN W\|year=1998\|title=An artificial neural network approach to rainfall-runoff modelling\|journal=Hydrological Sciences Journal\|volume=43\|issue=1\|pages=47–66\|doi=10.1080/02626669809492102\|doi-access=free\|bibcode=1998HydSJ..43...47D }}</ref> : <math> p_j(t) = \sum_i o_i(t) w_{ij}+ w_{0j}, </math> where <math>w_{0j}</math> is a bias. == Neural networks as functions == {{See also\|Graphical models}} ~~{{See also\|Graphical models}}Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision).~~ Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision) <math>\textstyle f : X \rightarrow Y </math> or a distribution over <math>\textstyle X</math> or both <math>\textstyle X</math> and <math>\textstyle Y</math>. Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a ''class'' of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons, number of layers or their connectivity). Mathematically, a neuron's network function <math>\textstyle f(x)</math> is defined as a composition of other functions <math>\textstyle g_i(x)</math>, that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the ''nonlinear weighted sum'', where <math>\textstyle f (x) = K \left(\sum_i w_i g_i(x)\right) </math>, where <math>\textstyle K</math> (commonly referred to as the [[activation function]]<ref>{{Cite web\|url=http://www.cse.unsw.edu.au/~billw/mldict.html#activnfn\|title=The Machine Learning Dictionary\|website=www.cse.unsw.edu.au\|access-date=2019-08-18\|archive-url=https://web.archive.org/web/20180826151959/http://www.cse.unsw.edu.au/~billw/mldict.html#activnfn\|archive-date=2018-08-26\|url-status=dead}}</ref>) is some predefined function, such as the [[Hyperbolic function#Standard analytic expressions\|hyperbolic tangent]], [[sigmoid function]], [[softmax function]], or [[ReLU\|rectifier function]]. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions <math>\textstyle g_i</math> as a [[Vector (mathematics and physics)\|vector]] <math>\textstyle g = (g_1, g_2, \ldots, g_n)</math>. [[~~File:Ann_dependency_(graph).svg\|link=https://en.wikipedia.org/wiki/~~File:Ann_dependency_(graph).svg\|thumb\|150x150px\|ANN dependency graph]] This figure depicts such a decomposition of <math>\textstyle f</math>, with dependencies between variables indicated by arrows. These can be interpreted in two ways. Line 46 ⟶ 47: The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of <math>\textstyle g</math> are independent of each other given their input <math>\textstyle h</math>). This naturally enables a degree of parallelism in the implementation. [[~~File:Recurrent_ann_dependency_graph.png\|link=https://en.wikipedia.org/wiki/~~File:Recurrent_ann_dependency_graph.png\|thumb\|134x134px\|Two separate depictions of the recurrent ANN dependency graph]] Networks such as the previous one are commonly called [[Feedforward neural network\|feedforward]], because their graph is a [[directed acyclic graph]]. Networks with [[Cycle (graph theory)\|cycles]] are commonly called [[Recurrent neural network\|recurrent]]. Such networks are commonly depicted in the manner shown at the top of the figure, where <math>\textstyle f</math> is shown as dependent upon itself. However, an implied temporal dependence is not shown. Line 52 ⟶ 53: Backpropagation training algorithms fall into three categories: * [[Gradient descent\|steepest descent]] (with variable [[learning rate]] and [[Gradient descent~~#The~~ ~~momentum~~with ~~method~~momentum\|momentum]], [[Rprop\|resilient backpropagation]]); * quasi-Newton ([[Broyden–Fletcher–Goldfarb–Shanno algorithm\|Broyden–Fletcher–Goldfarb–Shanno]], [[Secant method\|one step secant]]); * [[Levenberg–Marquardt algorithm\|Levenberg–Marquardt]] and [[Conjugate gradient method\|conjugate gradient]] (Fletcher–Reeves update, Polak–Ribiére update, Powell–Beale restart, scaled conjugate gradient).<ref>{{cite conference\|author1=M. Forouzanfar\|author2=H. R. Dajani\|author3=V. Z. Groza\|author4=M. Bolic\|author5=S. Rajan\|~~last~~name-~~author~~list-~~amp~~style=~~yes~~amp\|date=July 2010\|title=Comparison of Feed-Forward Neural Network Training Algorithms for Oscillometric Blood Pressure Estimation\|url=https://www.researchgate.net/publication/224173336\|conference=4th Int. Workshop Soft Computing Applications\|___location=Arad, Romania\|publisher=IEEE}}</ref> === Algorithm === Let <math>N</math> be a network with <math>e</math> connections, <math>m</math> inputs and <math>n</math> outputs. Below, <math>x_1,x_2,\dots</math> ~~denotes~~denote vectors in <math>\mathbb{R}^m</math>, <math>y_1,y_2,\dots</math> vectors in <math>\mathbb{R}^n</math>, and <math>w_0, w_1, w_2, \ldots</math> vectors in <math>\mathbb{R}^e</math>. These are called ''inputs'', ''outputs'' and ''weights'', respectively. The network corresponds to a function <math>y = f_N(w, x)</math> which, given a weight <math>w</math>, maps an input <math>x</math> to an output <math>y</math>. In [[supervised learning]], a sequence of ''[[Training set\|training examples]]'' <math>(x_1,y_1), \dots, (x_p, y_p)</math> produces a sequence of weights <math>w_0, w_1, \dots, w_p</math> starting from some initial weight <math>w_0</math>, usually chosen at random. These weights are computed in turn: first compute <math>w_i</math> using only <math>(x_i, y_i, w_{i-1})</math> for <math>i = 1, \dots, p</math>. The output of the algorithm is then <math>w_p</math>, giving a new function <math>x \mapsto f_N(w_p, x)</math>. The computation is the same in each step, hence only the case <math>i = 1</math> is described. <math>~~W_1~~w_1</math> is calculated from <math>(x_1, y_1, w_0)</math> ~~is calculated~~ by considering a variable weight <math>w</math> and applying [[gradient descent]] to the function <math>w\mapsto E(f_N(w, x_1), y_1)</math> to find a local minimum, starting at <math>w = w_0</math>. This makes <math>w_1</math> the minimizing weight found by gradient descent. == Learning ~~pseudo-code~~pseudocode == To implement the algorithm above, explicit formulas are required for the gradient of the function <math>w \mapsto E(f_N(w, x), y)</math> where the function is <math>E(y,y')= \|y-y'\|^2</math>. Line 95 ⟶ 96: === Pseudocode === [[Pseudocode]] for a [[stochastic gradient descent]] algorithm for training a three-layer network (one hidden layer): initialize network weights (often small random values). '''do''' '''~~forEach~~for each''' training example named ex '''do''' prediction = <u>neural-net-output</u>(network, ex) ''// forward pass'' actual = <u>teacher-output</u>(ex) Line 104 ⟶ 106: {{nowrap\|compute <math>\Delta w_i</math> for all weights from input layer to hidden layer}} ''// backward pass continued'' update network weights ''// input layer not modified by error estimate'' '''until''' error rate becomes acceptably low '''return''' the network The lines labeled "backward pass" can be implemented using the backpropagation algorithm, which calculates the gradient of the error of the network regarding the network's modifiable weights.<ref>Werbos, Paul J. (1994). ''The Roots of Backpropagation''. From Ordered Derivatives to Neural Networks and Political Forecasting. New York, NY: John Wiley & Sons, Inc.</ref> Line 111 ⟶ 114: {{Reflist}} {{Mathematics of}} ~~== External links ==~~ [[Category:Computational statistics]] [[Category:Artificial neural networks\| ]]