Hyperparameter optimization: Difference between revisions

Content deleted Content added
No edit summary
Tags: Mobile edit Mobile web edit
Tags: Mobile edit Mobile web edit Advanced mobile edit
 
(20 intermediate revisions by 10 users not shown)
Line 1:
{{Short description|MachineProcess learningof finding the optimal set of variables problem}}
for a machine learning algorithm}}
In [[machine learning]], '''hyperparameter optimization'''<ref>Matthias Feurer and Frank Hutter. [https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf Hyperparameter optimization]. In: ''AutoML: Methods, Systems, Challenges'', pages 3–38.</ref> or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)|hyperparameters]] for a learning algorithm. A hyperparameter is a [[parameter]] whose value is used to control the learning process, which must be configured before the process starts.<ref>{{cite journal |last1=Yang|first1=Li|title=On hyperparameter optimization of machine learning algorithms: Theory and practice|journal=Neurocomputing|year=2020|volume=415|pages=295–316|doi=10.1016/j.neucom.2020.07.061|arxiv=2007.15745 }}</ref><ref>{{cite arXiv |vauthors=Franceschi L, Donini M, Perrone V, Klein A, Archambeau C, Seeger M, Pontil M, Frasconi P |title=Hyperparameter Optimization in Machine Learning |year=2024 |class=stat.ML |eprint=2410.22854 }}</ref>
 
Hyperparameter optimization determines the set of hyperparameters that yields an optimal model which minimizes a predefined [[loss function]] on a given [[data set]].<ref name=abs1502.02127>{{cite arXiv |eprint=1502.02127|last1=Claesen|first1=Marc|title=Hyperparameter Search in Machine Learning|author2=Bart De Moor|class=cs.LG|year=2015}}</ref> The objective function takes a set of hyperparameters and returns the associated loss.<ref name=abs1502.02127/> [[Cross-validation (statistics)|Cross-validation]] is often used to estimate this generalization performance, and therefore choose the set of values for hyperparameters that maximize it.<ref name="bergstra">{{cite journal|last1=Bergstra|first1=James|last2=Bengio|first2=Yoshua|year=2012|title=Random Search for Hyper-Parameter Optimization|url=http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf|journal=Journal of Machine Learning Research|volume=13|pages=281–305}}</ref>
 
== Approaches ==
Line 106 ⟶ 107:
| bibcode = 2012arXiv1208.3719T
| arxiv = 1208.3719
}}</ref><ref name="krnc">{{Citation
|last=Kernc
|title=SAMBO: Sequential And Model-Based Optimization: Efficient global optimization in Python
|date=2024
|url=https://zenodo.org/records/14461363
|access-date=2025-01-30
|doi=10.5281/zenodo.14461363
}}</ref> to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.
 
Line 111 ⟶ 119:
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite book |last1=Larsen|first1=Jan|last2= Hansen |first2=Lars Kai|last3=Svarer|first3=Claus|last4=Ohlsson|first4=M|title=Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop |chapter=Design and regularization of neural networks: The optimal use of a validation set |date=1996|pages=62–71|doi=10.1109/NNSP.1996.548336|isbn=0-7803-3550-3|citeseerx=10.1.1.415.3266|s2cid=238874|chapter-url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal |author1=Olivier Chapelle |author2=Vladimir Vapnik |author3=Olivier Bousquet |author4=Sayan Mukherjee |title=Choosing multiple parameters for support vector machines |journal=Machine Learning |year=2002 |volume=46 |pages=131–159 |url=http://www.chapelle.cc/olivier/pub/mlj02.pdf | doi = 10.1023/a:1012450327387 |doi-access=free }}</ref> or logistic regression.<ref>{{cite journal |author1 =Chuong B|author2= Chuan-Sheng Foo|author3=Andrew Y Ng|journal = Advances in Neural Information Processing Systems |volume=20|title = Efficient multiple hyperparameter learning for log-linear models|year =2008|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref>
 
A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal|last1=Domke|first1=Justin|title=Generic Methods for Optimization-Based Modeling|journal=Aistats|date=2012|volume=22|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|access-date=2017-12-09|archive-date=2014-01-24|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv |last1=Maclaurin|first1=Dougal|last2=Duvenaud|first2=David|last3=Adams|first3=Ryan P.|eprint=1502.03492|title=Gradient-based Hyperparameter Optimization through Reversible Learning|class=stat.ML|date=2015}}</ref><ref>{{cite journal |last1=Franceschi |first1=Luca |last2=Donini |first2=Michele |last3=Frasconi |first3=Paolo |last4=Pontil |first4=Massimiliano |title=Forward and Reverse Gradient-Based Hyperparameter Optimization |journal=Proceedings of the 34th International Conference on Machine Learning |date=2017 |arxiv=1703.01785 |bibcode=2017arXiv170301785F |url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>Shaban,{{cite A.,arXiv Cheng,| Ceprint=1810.10667 A.,| Hatch,last1=Shaban N.,| &first1=Amirreza Boots,| B.last2=Cheng (2019,| April).first2=Ching-An [https://arxiv.org/pdf/1810.10667.pdf| Truncatedlast3=Hatch back-propagation| forfirst3=Nathan bilevel| optimization].last4=Boots In| ''Thefirst4=Byron 22nd| Internationaltitle=Truncated ConferenceBack-propagation onfor ArtificialBilevel IntelligenceOptimization and| Statistics''date=2018 (pp.| 1723-1732)class=cs.LG PMLR.}}</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.<ref>Lorraine,{{cite JarXiv | eprint=1911.,02590 | last1=Lorraine | first1=Jonathan | last2=Vicol, P.,| &first2=Paul | last3=Duvenaud, D.| (2018).first3=David [[arxiv:1911.02590| title=Optimizing Millions of Hyperparameters by Implicit Differentiation]]. ''arXiv| preprintdate=2019 arXiv:1911.02590''| class=cs.LG }}</ref>
 
In a different approach,<ref>Lorraine,{{cite JarXiv | eprint=1802.,09419 &| last1=Lorraine | first1=Jonathan | last2=Duvenaud, D.| (2018).first2=David [[arxiv:1802.09419| title=Stochastic hyperparameterHyperparameter optimizationOptimization through hypernetworks]].Hypernetworks ''arXiv| preprintdate=2018 arXiv:1802.09419''| class=cs.LG }}</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>MacKay,{{cite MarXiv | eprint=1903.,03088 | last1=MacKay | first1=Matthew | last2=Vicol, P.,| first2=Paul | last3=Lorraine, J.,| first3=Jon | last4=Duvenaud, D.,| &first4=David | last5=Grosse, R.| (2019).first5=Roger [[arxiv:1903.03088| title=Self-tuningTuning networksNetworks: Bilevel optimizationOptimization of hyperparametersHyperparameters using structuredStructured bestBest-responseResponse functions]].Functions ''arXiv| preprintdate=2019 arXiv:1903.03088''| class=cs.LG }}</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>Bae,{{cite arXiv J| eprint=2010.,13514 &| Grosse,last1=Bae R.| B.first1=Juhan (2020).| last2=Grosse [[arxiv:2010.13514| first2=Roger | title=Delta-stnSTN: Efficient bilevelBilevel optimizationOptimization for neuralNeural networksNetworks using structuredStructured responseResponse jacobians]].Jacobians ''Advances| indate=2020 Neural| Information Processing Systems'', ''33'',class=cs.LG 21725-21737.}}</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights.
 
Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters.<ref>Liu,{{cite HarXiv | eprint=1806.,09055 | last1=Liu | first1=Hanxiao | last2=Simonyan, K.,| &first2=Karen | last3=Yang, Y.| (2018).first3=Yiming [[arxiv:1806.09055|Darts title=DARTS: Differentiable architectureArchitecture search]].Search ''arXiv| preprintdate=2018 arXiv:1806.09055''| class=cs.LG }}</ref> Such methods have been extensively used for the optimization of architecture hyperparameters in [[neural architecture search]].
 
=== Evolutionary optimization ===
Line 123 ⟶ 131:
 
# Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+)
# Evaluate the hyperparametershyperparameter tuples and acquire their [[fitness function]] (e.g., 10-fold [[Cross-validation (statistics)|cross-validation]] accuracy of the machine learning algorithm with those hyperparameters)
# Rank the hyperparameter tuples by their relative fitness
# Replace the worst-performing hyperparameter tuples with new hyperparameter tuplesones generated throughvia [[crossover (genetic algorithm)|crossover]] and [[mutation (genetic algorithm)|mutation]]
# Repeat steps 2-4 until satisfactory algorithm performance is reached or algorithm performance is no longer improving.
 
Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,<ref name="bergstra11" /> [[automated machine learning]], typical neural network <ref name="kousiouris1">{{cite journal |vauthors=Kousiouris G, Cuccinotta T, Varvarigou T | year = 2011 | title= The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks | url= https://www.sciencedirect.com/science/article/abs/pii/S0164121211000951 | journal = Journal of Systems and Software | volume = 84 | issue = 8 | pages = 1270–1291| doi = 10.1016/j.jss.2011.04.013 | hdl = 11382/361472 | hdl-access = free }}</ref> and [[Deep learning#Deep neural networks|deep neural network]] architecture search,<ref name="miikkulainen1">{{cite arXiv | vauthors = Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B | year = 2017 | title = Evolving Deep Neural Networks |eprint=1703.00548| class = cs.NE }}</ref><ref name="jaderberg1">{{cite arXiv | vauthors = Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K | year = 2017 | title = Population Based Training of Neural Networks |eprint=1711.09846| class = cs.LG }}</ref> as well as training of the weights in deep neural networks.<ref name="such1">{{cite arXiv | vauthors = Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J | year = 2017 | title = Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning |eprint=1712.06567| class = cs.NE }}</ref>
Line 154 ⟶ 162:
* [[Self-tuning]]
* [[XGBoost]]
* [[Optuna]]
 
== References ==