Revision as of 09:37, 17 July 2025 edit Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,647 edits →Gradient descent ← Previous edit		Revision as of 11:06, 18 July 2025 edit undo Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,647 edits →Gradient descent Next edit →
Line 214: For recursively computing the partial derivatives, RTRL has a time-complexity of O(number of hidden x number of weights) per time step for computing the [[Jacobian matrix\|Jacobian matrices]], while BPTT only takes O(number of weights) per time step, at the cost of storing all forward activations within the given time horizon.<ref name="Ollivier2015">{{Cite arXiv \|eprint=1507.07680 \|class=cs.NE \|first1=Ollivier \|last1=Yann \|first2=Corentin \|last2=Tallec \|title=Training recurrent networks online without backtracking \|date=2015-07-28 \|first3=Guillaume \|last3=Charpiat}}</ref> An online hybrid between BPTT and RTRL with intermediate complexity exists,<ref>{{Cite journal \|last=Schmidhuber \|first=Jürgen \|date=1992-03-01 \|title=A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks \|journal=Neural Computation \|volume=4 \|issue=2 \|pages=243–248 \|doi=10.1162/neco.1992.4.2.243 \|s2cid=11761172}}</ref><ref>{{cite report \|url=http://citeseerx.ist.psu.edu/showciting?cid=128036 \|title=Complexity of exact gradient computation algorithms for recurrent neural networks \|last=Williams \|first=Ronald J. \|publisher=Northeastern University, College of Computer Science \|___location=Boston (MA) \|access-date=2017-07-02 \|archive-url=https://web.archive.org/web/20171020033840/http://citeseerx.ist.psu.edu/showciting?cid=128036 \|archive-date=2017-10-20 \|url-status=dead \|series=Technical Report NU-CCS-89-27 \|year=1989}}</ref> along with variants for continuous time.<ref>{{Cite journal \|last=Pearlmutter \|first=Barak A. \|date=1989-06-01 \|title=Learning State Space Trajectories in Recurrent Neural Networks \|url=http://repository.cmu.edu/cgi/viewcontent.cgi?article=2865&context=compsci \|journal=Neural Computation \|volume=1 \|issue=2 \|pages=263–269 \|doi=10.1162/neco.1989.1.2.263 \|s2cid=16813485}}</ref> A major problem with gradient descent for standard RNN architectures is that [[Vanishing gradient problem\|error gradients vanish]] exponentially quickly with the size of the time lag between important events.<ref name="hochreiter1991" /><ref name="HOCH2001">{{cite book \|last=Hochreiter \|first=Sepp \|title=A Field Guide to Dynamical Recurrent Networks \|date=15 January 2001 \|publisher=John Wiley & Sons \|isbn=978-0-7803-5369-5 \|editor-last1=Kolen \|editor-first1=John F. \|chapter=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies \|display-authors=etal \|editor-last2=Kremer \|editor-first2=Stefan C. \|chapter-url={{google books \|plainurl=y \|id=NWOcMVA64aAC }}}}</ref> LSTM combined with a BPTT/RTRL hybrid learning method attempts to overcome these problems.<ref name="lstm" /> This problem is also solved in the independently recurrent neural network (IndRNN)<ref name="auto" /> by reducing the context of a neuron to its own past state and the cross-neuron information can then be explored in the following layers. Memories of different ranges including long-term memory can be learned without the gradient vanishing and exploding ~~problem~~problems. The on-line algorithm called causal recursive backpropagation (CRBP), implements and combines BPTT and RTRL paradigms for locally recurrent networks.<ref>{{Cite journal \|last1=Campolucci \|first1=Paolo \|last2=Uncini \|first2=Aurelio \|last3=Piazza \|first3=Francesco \|last4=Rao \|first4=Bhaskar D. \|year=1999 \|title=On-Line Learning Algorithms for Locally Recurrent Neural Networks \|journal=IEEE Transactions on Neural Networks \|volume=10 \|issue=2 \|pages=253–271 \|citeseerx=10.1.1.33.7550 \|doi=10.1109/72.750549 \|pmid=18252525}}</ref> It works with the most general locally recurrent networks. The CRBP algorithm can minimize the global error term. This fact improves the stability of the algorithm, providing a unifying view of gradient calculation techniques for recurrent networks with local feedback.

Recurrent neural network: Difference between revisions