Content deleted Content added
Line 214:
For recursively computing the partial derivatives, RTRL has a time-complexity of O(number of hidden x number of weights) per time step for computing the [[Jacobian matrix|Jacobian matrices]], while BPTT only takes O(number of weights) per time step, at the cost of storing all forward activations within the given time horizon.<ref name="Ollivier2015">{{Cite arXiv |eprint=1507.07680 |class=cs.NE |first1=Ollivier |last1=Yann |first2=Corentin |last2=Tallec |title=Training recurrent networks online without backtracking |date=2015-07-28 |first3=Guillaume |last3=Charpiat}}</ref> An online hybrid between BPTT and RTRL with intermediate complexity exists,<ref>{{Cite journal |last=Schmidhuber |first=Jürgen |date=1992-03-01 |title=A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks |journal=Neural Computation |volume=4 |issue=2 |pages=243–248 |doi=10.1162/neco.1992.4.2.243 |s2cid=11761172}}</ref><ref>{{cite report |url=http://citeseerx.ist.psu.edu/showciting?cid=128036 |title=Complexity of exact gradient computation algorithms for recurrent neural networks |last=Williams |first=Ronald J. |publisher=Northeastern University, College of Computer Science |___location=Boston (MA) |access-date=2017-07-02 |archive-url=https://web.archive.org/web/20171020033840/http://citeseerx.ist.psu.edu/showciting?cid=128036 |archive-date=2017-10-20 |url-status=dead |series=Technical Report NU-CCS-89-27 |year=1989}}</ref> along with variants for continuous time.<ref>{{Cite journal |last=Pearlmutter |first=Barak A. |date=1989-06-01 |title=Learning State Space Trajectories in Recurrent Neural Networks |url=http://repository.cmu.edu/cgi/viewcontent.cgi?article=2865&context=compsci |journal=Neural Computation |volume=1 |issue=2 |pages=263–269 |doi=10.1162/neco.1989.1.2.263 |s2cid=16813485}}</ref>
A major problem with gradient descent for standard RNN architectures is that [[Vanishing gradient problem|error gradients vanish]] exponentially quickly with the size of the time lag between important events.<ref name="hochreiter1991" /><ref name="HOCH2001">{{cite book |last=Hochreiter |first=Sepp |title=A Field Guide to Dynamical Recurrent Networks |date=15 January 2001 |publisher=John Wiley & Sons |isbn=978-0-7803-5369-5 |editor-last1=Kolen |editor-first1=John F. |chapter=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies |display-authors=etal |editor-last2=Kremer |editor-first2=Stefan C. |chapter-url={{google books |plainurl=y |id=NWOcMVA64aAC }}}}</ref> LSTM combined with a BPTT/RTRL hybrid learning method attempts to overcome these problems.<ref name="lstm" /> This problem is also solved in the independently recurrent neural network (IndRNN)<ref name="auto" /> by reducing the context of a neuron to its own past state and the cross-neuron information can then be explored in the following layers. Memories of different ranges including long-term memory can be learned without the gradient vanishing and exploding
The on-line algorithm called causal recursive backpropagation (CRBP), implements and combines BPTT and RTRL paradigms for locally recurrent networks.<ref>{{Cite journal |last1=Campolucci |first1=Paolo |last2=Uncini |first2=Aurelio |last3=Piazza |first3=Francesco |last4=Rao |first4=Bhaskar D. |year=1999 |title=On-Line Learning Algorithms for Locally Recurrent Neural Networks |journal=IEEE Transactions on Neural Networks |volume=10 |issue=2 |pages=253–271 |citeseerx=10.1.1.33.7550 |doi=10.1109/72.750549 |pmid=18252525}}</ref> It works with the most general locally recurrent networks. The CRBP algorithm can minimize the global error term. This fact improves the stability of the algorithm, providing a unifying view of gradient calculation techniques for recurrent networks with local feedback.
|