Revision as of 08:17, 21 April 2025 edit Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 472,863 edits →Gradient-based optimization: \| Altered template type. Add: class, date, title, eprint, authors 1-5. Changed bare reference to CS1/2. Removed parameters. Some additions/deletions were parameter name changes. \| Use this tool. Report bugs. \| #UCB_Gadget ← Previous edit		Revision as of 08:17, 21 April 2025 edit undo Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 472,863 edits →Gradient-based optimization: \| Altered template type. Add: class, date, title, eprint, authors 1-4. Changed bare reference to CS1/2. Removed parameters. Some additions/deletions were parameter name changes. \| Use this tool. Report bugs. \| #UCB_Gadget Next edit →
Line 119: For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite book \|last1=Larsen\|first1=Jan\|last2= Hansen \|first2=Lars Kai\|last3=Svarer\|first3=Claus\|last4=Ohlsson\|first4=M\|title=Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop \|chapter=Design and regularization of neural networks: The optimal use of a validation set \|date=1996\|pages=62–71\|doi=10.1109/NNSP.1996.548336\|isbn=0-7803-3550-3\|citeseerx=10.1.1.415.3266\|s2cid=238874\|chapter-url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal \|author1=Olivier Chapelle \|author2=Vladimir Vapnik \|author3=Olivier Bousquet \|author4=Sayan Mukherjee \|title=Choosing multiple parameters for support vector machines \|journal=Machine Learning \|year=2002 \|volume=46 \|pages=131–159 \|url=http://www.chapelle.cc/olivier/pub/mlj02.pdf \| doi = 10.1023/a:1012450327387 \|doi-access=free }}</ref> or logistic regression.<ref>{{cite journal \|author1 =Chuong B\|author2= Chuan-Sheng Foo\|author3=Andrew Y Ng\|journal = Advances in Neural Information Processing Systems \|volume=20\|title = Efficient multiple hyperparameter learning for log-linear models\|year =2008\|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref> A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal\|last1=Domke\|first1=Justin\|title=Generic Methods for Optimization-Based Modeling\|journal=Aistats\|date=2012\|volume=22\|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf\|access-date=2017-12-09\|archive-date=2014-01-24\|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf\|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv \|last1=Maclaurin\|first1=Dougal\|last2=Duvenaud\|first2=David\|last3=Adams\|first3=Ryan P.\|eprint=1502.03492\|title=Gradient-based Hyperparameter Optimization through Reversible Learning\|class=stat.ML\|date=2015}}</ref><ref>{{cite journal \|last1=Franceschi \|first1=Luca \|last2=Donini \|first2=Michele \|last3=Frasconi \|first3=Paolo \|last4=Pontil \|first4=Massimiliano \|title=Forward and Reverse Gradient-Based Hyperparameter Optimization \|journal=Proceedings of the 34th International Conference on Machine Learning \|date=2017 \|arxiv=1703.01785 \|bibcode=2017arXiv170301785F \|url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>{{~~arxiv~~cite arXiv \| eprint=1810.10667 \| last1=Shaban \| first1=Amirreza \| last2=Cheng \| first2=Ching-An \| last3=Hatch \| first3=Nathan \| last4=Boots \| first4=Byron \| title=Truncated Back-propagation for Bilevel Optimization \| date=2018 \| class=cs.LG }}</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.<ref>{{cite arXiv \| eprint=1911.02590 \| last1=Lorraine \| first1=Jonathan \| last2=Vicol \| first2=Paul \| last3=Duvenaud \| first3=David \| title=Optimizing Millions of Hyperparameters by Implicit Differentiation \| date=2019 \| class=cs.LG }}</ref> In a different approach,<ref>{{cite arXiv \| eprint=1802.09419 \| last1=Lorraine \| first1=Jonathan \| last2=Duvenaud \| first2=David \| title=Stochastic Hyperparameter Optimization through Hypernetworks \| date=2018 \| class=cs.LG }}</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>{{cite arXiv \| eprint=1903.03088 \| last1=MacKay \| first1=Matthew \| last2=Vicol \| first2=Paul \| last3=Lorraine \| first3=Jon \| last4=Duvenaud \| first4=David \| last5=Grosse \| first5=Roger \| title=Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions \| date=2019 \| class=cs.LG }}</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>{{cite arXiv \| eprint=2010.13514 \| last1=Bae \| first1=Juhan \| last2=Grosse \| first2=Roger \| title=Delta-STN: Efficient Bilevel Optimization for Neural Networks using Structured Response Jacobians \| date=2020 \| class=cs.LG }}</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights.

Hyperparameter optimization: Difference between revisions