Hyperparameter optimization: Difference between revisions

Content deleted Content added
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.3
Line 107:
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks.<ref>{{cite journal |last1=Larsen|first1=Jan|last2= Hansen |first2=Lars Kai|last3=Svarer|first3=Claus|last4=Ohlsson|first4=M|title=Design and regularization of neural networks: the optimal use of a validation set|journal=Proceedings of the 1996 IEEE Signal Processing Society Workshop|date=1996|pages=62–71|doi=10.1109/NNSP.1996.548336|isbn=0-7803-3550-3|citeseerx=10.1.1.415.3266|s2cid=238874|url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal |author1=Olivier Chapelle |author2=Vladimir Vapnik |author3=Olivier Bousquet |author4=Sayan Mukherjee |title=Choosing multiple parameters for support vector machines |journal=Machine Learning |year=2002 |volume=46 |pages=131–159 |url=http://www.chapelle.cc/olivier/pub/mlj02.pdf | doi = 10.1023/a:1012450327387 |doi-access=free }}</ref> or logistic regression.<ref>{{cite journal |author1 =Chuong B|author2= Chuan-Sheng Foo|author3=Andrew Y Ng|journal = Advances in Neural Information Processing Systems 20|title = Efficient multiple hyperparameter learning for log-linear models|year =2008|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref>
 
A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal|last1=Domke|first1=Justin|title=Generic Methods for Optimization-Based Modeling|journal=Aistats |date=2012|volume=22|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|access-date=2017-12-09|archive-date=2014-01-24|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv |last1=Maclaurin|first1=Douglas|last2=Duvenaud|first2=David|last3=Adams|first3=Ryan P.|eprint=1502.03492|title=Gradient-based Hyperparameter Optimization through Reversible Learning|class=stat.ML|date=2015}}</ref><ref>{{cite journal |last1=Franceschi |first1=Luca |last2=Donini |first2=Michele |last3=Frasconi |first3=Paolo |last4=Pontil |first4=Massimiliano |title=Forward and Reverse Gradient-Based Hyperparameter Optimization |journal=Proceedings of the 34th International Conference on Machine Learning |date=2017 |arxiv=1703.01785 |bibcode=2017arXiv170301785F |url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>Shaban, A., Cheng, C. A., Hatch, N., & Boots, B. (2019, April). [https://arxiv.org/pdf/1810.10667.pdf Truncated back-propagation for bilevel optimization]. In ''The 22nd International Conference on Artificial Intelligence and Statistics'' (pp. 1723-1732). PMLR.</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.
 
In a different approach,<ref>Lorraine, J., & Duvenaud, D. (2018). [[arxiv:1802.09419|Stochastic hyperparameter optimization through hypernetworks]]. ''arXiv preprint arXiv:1802.09419''.</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., & Grosse, R. (2019). [[arxiv:1903.03088|Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions]]. ''arXiv preprint arXiv:1903.03088''.</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>Bae, J., & Grosse, R. B. (2020). [[arxiv:2010.13514|Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians]]. ''Advances in Neural Information Processing Systems'', ''33'', 21725-21737.</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights.