Hyperparameter optimization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:15, 19 December 2023 edit DMH223344 (talk \| contribs) Extended confirmed users 3,184 edits →Random search: add note about continuous and grid search Tag: Visual edit ← Previous edit		Latest revision as of 18:34, 13 August 2025 edit undo Maxeto0910 (talk \| contribs) Extended confirmed users 116,705 edits per WP:HOWTOSD Tags: Mobile edit Mobile web edit Advanced mobile edit
(24 intermediate revisions by 12 users not shown)
Line 1: {{Short description\|~~Machine~~Process ~~learning~~of finding the optimal set of variables ~~problem}}~~ for a machine learning algorithm}} In [[machine learning]], '''hyperparameter optimization'''<ref>Matthias Feurer and Frank Hutter. [https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf Hyperparameter optimization]. In: ''AutoML: Methods, Systems, Challenges'', pages 3–38.</ref> or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)\|hyperparameters]] for a learning algorithm. A hyperparameter is a [[parameter]] whose value is used to control the learning process, which must be configured before the process starts.<ref>{{cite journal \|last1=Yang\|first1=Li\|title=On hyperparameter optimization of machine learning algorithms: Theory and practice\|journal=Neurocomputing\|year=2020\|volume=415\|pages=295–316\|doi=10.1016/j.neucom.2020.07.061\|arxiv=2007.15745 }}</ref><ref>{{cite arXiv \|vauthors=Franceschi L, Donini M, Perrone V, Klein A, Archambeau C, Seeger M, Pontil M, Frasconi P \|title=Hyperparameter Optimization in Machine Learning \|year=2024 \|class=stat.ML \|eprint=2410.22854 }}</ref> Hyperparameter optimization ~~finds~~determines athe ~~tuple~~set of hyperparameters that yields an optimal model which minimizes a predefined [[loss function]] on a given ~~independent~~ [[data set]].<ref name=abs1502.02127>{{cite arXiv \|eprint=1502.02127\|last1=Claesen\|first1=Marc\|title=Hyperparameter Search in Machine Learning\|author2=Bart De Moor\|class=cs.LG\|year=2015}}</ref> The objective function takes a ~~tuple~~set of hyperparameters and returns the associated loss.<ref name=abs1502.02127/> [[Cross-validation (statistics)\|Cross-validation]] is often used to estimate this generalization performance, and therefore choose the set of values for hyperparameters that maximize it.<ref name="bergstra">{{cite journal\|last1=Bergstra\|first1=James\|last2=Bengio\|first2=Yoshua\|year=2012\|title=Random Search for Hyper-Parameter Optimization\|url=http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf\|journal=Journal of Machine Learning Research\|volume=13\|pages=281–305}}</ref> == Approaches == Line 9 ⟶ 10: === Grid search === The traditional ~~way~~method ~~of performing~~for hyperparameter optimization has been ''grid search'', or a ''parameter sweep'', which is simply an [[Brute-force search\|exhaustive searching]] through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by [[Cross-validation (statistics)\|cross-validation]] on the training set<ref>Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). [http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf A practical guide to support vector classification]. Technical Report, [[National Taiwan University]].</ref> or evaluation on a hold-out validation set.<ref>{{cite journal \| vauthors = Chicco D Line 106 ⟶ 107: \| bibcode = 2012arXiv1208.3719T \| arxiv = 1208.3719 }}</ref><ref name="krnc">{{Citation \|last=Kernc \|title=SAMBO: Sequential And Model-Based Optimization: Efficient global optimization in Python \|date=2024 \|url=https://zenodo.org/records/14461363 \|access-date=2025-01-30 \|doi=10.5281/zenodo.14461363 }}</ref> to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run. Line 111 ⟶ 119: For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite book \|last1=Larsen\|first1=Jan\|last2= Hansen \|first2=Lars Kai\|last3=Svarer\|first3=Claus\|last4=Ohlsson\|first4=M\|title=Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop \|chapter=Design and regularization of neural networks: The optimal use of a validation set \|date=1996\|pages=62–71\|doi=10.1109/NNSP.1996.548336\|isbn=0-7803-3550-3\|citeseerx=10.1.1.415.3266\|s2cid=238874\|chapter-url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal \|author1=Olivier Chapelle \|author2=Vladimir Vapnik \|author3=Olivier Bousquet \|author4=Sayan Mukherjee \|title=Choosing multiple parameters for support vector machines \|journal=Machine Learning \|year=2002 \|volume=46 \|pages=131–159 \|url=http://www.chapelle.cc/olivier/pub/mlj02.pdf \| doi = 10.1023/a:1012450327387 \|doi-access=free }}</ref> or logistic regression.<ref>{{cite journal \|author1 =Chuong B\|author2= Chuan-Sheng Foo\|author3=Andrew Y Ng\|journal = Advances in Neural Information Processing Systems \|volume=20\|title = Efficient multiple hyperparameter learning for log-linear models\|year =2008\|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref> A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal\|last1=Domke\|first1=Justin\|title=Generic Methods for Optimization-Based Modeling\|journal=Aistats\|date=2012\|volume=22\|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf\|access-date=2017-12-09\|archive-date=2014-01-24\|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf\|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv \|last1=Maclaurin\|first1=Dougal\|last2=Duvenaud\|first2=David\|last3=Adams\|first3=Ryan P.\|eprint=1502.03492\|title=Gradient-based Hyperparameter Optimization through Reversible Learning\|class=stat.ML\|date=2015}}</ref><ref>{{cite journal \|last1=Franceschi \|first1=Luca \|last2=Donini \|first2=Michele \|last3=Frasconi \|first3=Paolo \|last4=Pontil \|first4=Massimiliano \|title=Forward and Reverse Gradient-Based Hyperparameter Optimization \|journal=Proceedings of the 34th International Conference on Machine Learning \|date=2017 \|arxiv=1703.01785 \|bibcode=2017arXiv170301785F \|url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>~~Shaban,~~{{cite ~~A.,~~arXiv ~~Cheng,~~\| Ceprint=1810.10667 ~~A.,~~\| ~~Hatch,~~last1=Shaban ~~N.,~~\| &first1=Amirreza ~~Boots,~~\| B.last2=Cheng ~~(2019,~~\| ~~April).~~first2=Ching-An ~~[https://arxiv.org/pdf/1810.10667.pdf~~\| ~~Truncated~~last3=Hatch ~~back-propagation~~\| ~~for~~first3=Nathan ~~bilevel~~\| ~~optimization].~~last4=Boots In\| ~~''The~~first4=Byron ~~22nd~~\| ~~International~~title=Truncated ~~Conference~~Back-propagation onfor ~~Artificial~~Bilevel ~~Intelligence~~Optimization ~~and~~\| ~~Statistics''~~date=2018 ~~(pp.~~\| ~~1723-1732)~~class=cs.LG ~~PMLR.~~}}</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.<ref>~~Lorraine,~~{{cite JarXiv \| eprint=1911.,02590 \| last1=Lorraine \| first1=Jonathan \| last2=Vicol, ~~P.,~~\| &first2=Paul \| last3=Duvenaud, D.\| ~~(2018).~~first3=David ~~[[arxiv:1911.02590~~\| title=Optimizing Millions of Hyperparameters by Implicit Differentiation~~]].~~ ~~''arXiv~~\| ~~preprint~~date=2019 ~~arXiv:1911.02590''~~\| class=cs.LG }}</ref> In a different approach,<ref>~~Lorraine,~~{{cite JarXiv \| eprint=1802.,09419 &\| last1=Lorraine \| first1=Jonathan \| last2=Duvenaud, D.\| ~~(2018).~~first2=David ~~[[arxiv:1802.09419~~\| title=Stochastic ~~hyperparameter~~Hyperparameter ~~optimization~~Optimization through ~~hypernetworks]].~~Hypernetworks ~~''arXiv~~\| ~~preprint~~date=2018 ~~arXiv:1802.09419''~~\| class=cs.LG }}</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>~~MacKay,~~{{cite MarXiv \| eprint=1903.,03088 \| last1=MacKay \| first1=Matthew \| last2=Vicol, ~~P.,~~\| first2=Paul \| last3=Lorraine, ~~J.,~~\| first3=Jon \| last4=Duvenaud, ~~D.,~~\| &first4=David \| last5=Grosse, R.\| ~~(2019).~~first5=Roger ~~[[arxiv:1903.03088~~\| title=Self-~~tuning~~Tuning ~~networks~~Networks: Bilevel ~~optimization~~Optimization of ~~hyperparameters~~Hyperparameters using ~~structured~~Structured ~~best~~Best-~~response~~Response ~~functions]].~~Functions ~~''arXiv~~\| ~~preprint~~date=2019 ~~arXiv:1903.03088''~~\| class=cs.LG }}</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>~~Bae,~~{{cite arXiv J\| eprint=2010.,13514 &\| ~~Grosse,~~last1=Bae R.\| B.first1=Juhan ~~(2020).~~\| last2=Grosse ~~[[arxiv:2010.13514~~\| first2=Roger \| title=Delta-~~stn~~STN: Efficient ~~bilevel~~Bilevel ~~optimization~~Optimization for ~~neural~~Neural ~~networks~~Networks using ~~structured~~Structured ~~response~~Response ~~jacobians]].~~Jacobians ~~''Advances~~\| indate=2020 ~~Neural~~\| ~~Information Processing Systems'', ''33'',~~class=cs.LG ~~21725-21737.~~}}</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights. Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters.<ref>~~Liu,~~{{cite HarXiv \| eprint=1806.,09055 \| last1=Liu \| first1=Hanxiao \| last2=Simonyan, ~~K.,~~\| &first2=Karen \| last3=Yang, Y.\| ~~(2018).~~first3=Yiming ~~[[arxiv:1806.09055~~\|~~Darts~~ title=DARTS: Differentiable ~~architecture~~Architecture ~~search]].~~Search ~~''arXiv~~\| ~~preprint~~date=2018 ~~arXiv:1806.09055''~~\| class=cs.LG }}</ref> Such methods have been extensively used for the optimization of architecture hyperparameters in [[neural architecture search]]. === Evolutionary optimization === Line 123 ⟶ 131: # Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+) # Evaluate the ~~hyperparameters~~hyperparameter tuples and acquire their [[fitness function]] (e.g., 10-fold [[Cross-validation (statistics)\|cross-validation]] accuracy of the machine learning algorithm with those hyperparameters) # Rank the hyperparameter tuples by their relative fitness # Replace the worst-performing hyperparameter tuples with new ~~hyperparameter tuples~~ones generated ~~through~~via [[crossover (genetic algorithm)\|crossover]] and [[mutation (genetic algorithm)\|mutation]] # Repeat steps 2-4 until satisfactory algorithm performance is reached or ~~algorithm performance~~ is no longer improving. Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,<ref name="bergstra11" /> [[automated machine learning]], typical neural network <ref name="kousiouris1">{{cite journal \|vauthors=Kousiouris G, Cuccinotta T, Varvarigou T \| year = 2011 \| title= The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks \| url= https://www.sciencedirect.com/science/article/abs/pii/S0164121211000951 \| journal = Journal of Systems and Software \| volume = 84 \| issue = 8 \| pages = 1270–1291\| doi = 10.1016/j.jss.2011.04.013 \| hdl = 11382/361472 \| hdl-access = free }}</ref> and [[Deep learning#Deep neural networks\|deep neural network]] architecture search,<ref name="miikkulainen1">{{cite arXiv \| vauthors = Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B \| year = 2017 \| title = Evolving Deep Neural Networks \|eprint=1703.00548\| class = cs.NE }}</ref><ref name="jaderberg1">{{cite arXiv \| vauthors = Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K \| year = 2017 \| title = Population Based Training of Neural Networks \|eprint=1711.09846\| class = cs.LG }}</ref> as well as training of the weights in deep neural networks.<ref name="such1">{{cite arXiv \| vauthors = Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J \| year = 2017 \| title = Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning \|eprint=1712.06567\| class = cs.NE }}</ref> Line 137 ⟶ 145: === Early stopping-based === [[File:Successive-halving-for-eight-arbitrary-hyperparameter-configurations.png\|thumb\|Successive halving for eight arbitrary hyperparameter configurations. The approach starts with eight models with different configurations and consecutively applies successive halving until only one model remains.]] A class of early stopping-based hyperparameter optimization algorithms is purpose built for large search spaces of continuous and discrete hyperparameters, particularly when the computational cost to evaluate the performance of a set of hyperparameters is high. Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly.<ref name="irace">{{cite journal \|last1=López-Ibáñez \|first1=Manuel \|last2=Dubois-Lacoste \|first2=Jérémie \|last3=Pérez Cáceres \|first3=Leslie \|last4=Stützle \|first4=Thomas \|last5=Birattari \|first5=Mauro \|date=2016 \|title=The irace package: Iterated Racing for Automatic Algorithm Configuration \|journal=Operations Research Perspective \|volume=3 \|issue=3 \|pages=43–58 \|doi=10.1016/j.orp.2016.09.002\|doi-access=free \|hdl=10419/178265 \|hdl-access=free }}</ref><ref name="race">{{cite journal \|last1=Birattari \|first1=Mauro \|last2=Stützle \|first2=Thomas \|last3=Paquete \|first3=Luis \|last4=Varrentrapp \|first4=Klaus \|date=2002 \|title=A Racing Algorithm for Configuring Metaheuristics \|journal=Gecco 2002 \|pages=11–18}}</ref> Another early stopping hyperparameter optimization algorithm is successive halving (SHA),<ref>{{cite arXiv\|last1=Jamieson\|first1=Kevin\|last2=Talwalkar\|first2=Ameet\|date=2015-02-27\|title=Non-stochastic Best Arm Identification and Hyperparameter Optimization\|eprint=1502.07943\|class=cs.LG}}</ref> which begins as a random search but periodically prunes low-performing models, thereby focusing computational resources on more promising models. Asynchronous successive halving (ASHA)<ref>{{cite arXiv\|last1=Li\|first1=Liam\|last2=Jamieson\|first2=Kevin\|last3=Rostamizadeh\|first3=Afshin\|last4=Gonina\|first4=Ekaterina\|last5=Hardt\|first5=Moritz\|last6=Recht\|first6=Benjamin\|last7=Talwalkar\|first7=Ameet\|date=2020-03-16\|title=A System for Massively Parallel Hyperparameter Tuning\|class=cs.LG\|eprint=1810.05934v5}}</ref> further improves upon SHA's resource utilization profile by removing the need to synchronously evaluate and prune low-performing models. Hyperband<ref>{{cite journal\|last1=Li\|first1=Lisha\|last2=Jamieson\|first2=Kevin\|last3=DeSalvo\|first3=Giulia\|last4=Rostamizadeh\|first4=Afshin\|last5=Talwalkar\|first5=Ameet\|date=2020-03-16\|title=Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization\|journal=Journal of Machine Learning Research\|volume=18\|pages=1–52\|arxiv=1603.06560}}</ref> is a higher level early stopping-based algorithm that invokes SHA or ASHA multiple times with varying levels of pruning aggressiveness, in order to be more widely applicable and with fewer required inputs. Line 154 ⟶ 162: * [[Self-tuning]] * [[XGBoost]] * [[Optuna]] == References ==