Hyperparameter optimization: Difference between revisions

Content deleted Content added
No information added, just a citation. Undid revision 1140546203 by Neeyoy (talk)
Pipatron (talk | contribs)
m Gradient-based optimization: Interwiki link to gradient descent was missing.
Line 105:
 
=== Gradient-based optimization ===
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite journal |last1=Larsen|first1=Jan|last2= Hansen |first2=Lars Kai|last3=Svarer|first3=Claus|last4=Ohlsson|first4=M|title=Design and regularization of neural networks: the optimal use of a validation set|journal=Proceedings of the 1996 IEEE Signal Processing Society Workshop|date=1996|pages=62–71|doi=10.1109/NNSP.1996.548336|isbn=0-7803-3550-3|citeseerx=10.1.1.415.3266|s2cid=238874|url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal |author1=Olivier Chapelle |author2=Vladimir Vapnik |author3=Olivier Bousquet |author4=Sayan Mukherjee |title=Choosing multiple parameters for support vector machines |journal=Machine Learning |year=2002 |volume=46 |pages=131–159 |url=http://www.chapelle.cc/olivier/pub/mlj02.pdf | doi = 10.1023/a:1012450327387 |doi-access=free }}</ref> or logistic regression.<ref>{{cite journal |author1 =Chuong B|author2= Chuan-Sheng Foo|author3=Andrew Y Ng|journal = Advances in Neural Information Processing Systems 20|title = Efficient multiple hyperparameter learning for log-linear models|year =2008|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref>
 
A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal|last1=Domke|first1=Justin|title=Generic Methods for Optimization-Based Modeling|journal=Aistats|date=2012|volume=22|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|access-date=2017-12-09|archive-date=2014-01-24|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv |last1=Maclaurin|first1=Douglas|last2=Duvenaud|first2=David|last3=Adams|first3=Ryan P.|eprint=1502.03492|title=Gradient-based Hyperparameter Optimization through Reversible Learning|class=stat.ML|date=2015}}</ref><ref>{{cite journal |last1=Franceschi |first1=Luca |last2=Donini |first2=Michele |last3=Frasconi |first3=Paolo |last4=Pontil |first4=Massimiliano |title=Forward and Reverse Gradient-Based Hyperparameter Optimization |journal=Proceedings of the 34th International Conference on Machine Learning |date=2017 |arxiv=1703.01785 |bibcode=2017arXiv170301785F |url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>Shaban, A., Cheng, C. A., Hatch, N., & Boots, B. (2019, April). [https://arxiv.org/pdf/1810.10667.pdf Truncated back-propagation for bilevel optimization]. In ''The 22nd International Conference on Artificial Intelligence and Statistics'' (pp. 1723-1732). PMLR.</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.