Hyperparameter optimization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:27, 23 July 2023 edit Citation bot (talk \| contribs) Bots 5,861,177 edits Alter: title, template type. Add: chapter-url, chapter. Removed or converted URL. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 \| #UCB_webform_linked 525/2384 ← Previous edit		Latest revision as of 18:34, 13 August 2025 edit undo Maxeto0910 (talk \| contribs) Extended confirmed users 116,704 edits per WP:HOWTOSD Tags: Mobile edit Mobile web edit Advanced mobile edit
(35 intermediate revisions by 17 users not shown)
Line 1: {{Short description\|~~Machine~~Process ~~learning~~of finding the optimal set of variables ~~problem}}~~ for a machine learning algorithm}} In [[machine learning]], '''hyperparameter optimization'''<ref>Matthias Feurer and Frank Hutter. [https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf Hyperparameter optimization]. In: ''AutoML: Methods, Systems, Challenges'', pages 3–38.</ref> or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)\|hyperparameters]] for a learning algorithm. A hyperparameter is a [[parameter]] whose value is used to control the learning process., Bywhich ~~contrast,~~must be configured before the ~~values~~process starts.<ref>{{cite journal \|last1=Yang\|first1=Li\|title=On hyperparameter optimization of ~~other~~machine ~~parameters~~learning ~~(typically~~algorithms: ~~node~~Theory ~~weights)~~and ~~are~~practice\|journal=Neurocomputing\|year=2020\|volume=415\|pages=295–316\|doi=10.1016/j.neucom.2020.07.061\|arxiv=2007.15745 ~~learned~~}}</ref><ref>{{cite arXiv \|vauthors=Franceschi L, Donini M, Perrone V, Klein A, Archambeau C, Seeger M, Pontil M, Frasconi P \|title=Hyperparameter Optimization in Machine Learning \|year=2024 \|class=stat.ML \|eprint=2410.22854 }}</ref> The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization ~~finds~~determines athe ~~tuple~~set of hyperparameters that yields an optimal model which minimizes a predefined [[loss function]] on a given ~~independent~~ [[data set]].<ref name=abs1502.02127>{{cite arXiv \|eprint=1502.02127\|last1=Claesen\|first1=Marc\|title=Hyperparameter Search in Machine Learning\|author2=Bart De Moor\|class=cs.LG\|year=2015}}</ref> The objective function takes a ~~tuple~~set of hyperparameters and returns the associated loss.<ref name=abs1502.02127/> [[Cross-validation (statistics)\|Cross-validation]] is often used to estimate this generalization performance, and therefore choose the set of values for hyperparameters that maximize it.<ref name="bergstra">{{cite journal\|last1=Bergstra\|first1=James\|last2=Bengio\|first2=Yoshua\|year=2012\|title=Random Search for Hyper-Parameter Optimization\|url=http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf\|journal=Journal of Machine Learning Research\|volume=13\|pages=281–305}}</ref> == Approaches == Line 9 ⟶ 10: === Grid search === The traditional ~~way~~method ~~of performing~~for hyperparameter optimization has been ''grid search'', or a ''parameter sweep'', which is simply an [[Brute-force search\|exhaustive searching]] through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by [[Cross-validation (statistics)\|cross-validation]] on the training set<ref>Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). [http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf A practical guide to support vector classification]. Technical Report, [[National Taiwan University]].</ref> or evaluation on a hold-out validation set.<ref>{{cite journal \| vauthors = Chicco D Line 20 ⟶ 21: \| pmid = 29234465 \| doi = 10.1186/s13040-017-0155-3 \| pmc= 5721660~~}}</ref>~~ \| doi-access = free }}</ref> Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search. Line 36 ⟶ 39: === Random search === Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. A benefit over grid search is that random search can explore many more values than grid search could for continuous hyperparameters. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm.<ref name="bergstra" /> In this case, the optimization problem is said to have a low intrinsic dimensionality.<ref>{{Cite journal\|last1=Ziyu\|first1=Wang\|last2=Frank\|first2=Hutter\|last3=Masrour\|first3=Zoghi\|last4=David\|first4=Matheson\|last5=Nando\|first5=de Feitas\|date=2016\|title=Bayesian Optimization in a Billion Dimensions via Random Embeddings\|journal=Journal of Artificial Intelligence Research\|language=en\|volume=55\|pages=361–387\|doi=10.1613/jair.4806\|arxiv=1301.1942\|s2cid=279236}}</ref> Random Search is also [[embarrassingly parallel]], and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample. Despite its simplicity, random search remains one of the important base-lines against which to compare the performance of new hyperparameter optimization methods. [[File:Hyperparameter Optimization using Tree-Structured Parzen Estimators.svg\|thumb\|Methods such as Bayesian optimization smartly explore the space of potential choices of hyperparameters by deciding which combination to explore next based on previous observations.]] Line 50 ⟶ 53: \| last3 = Leyton-Brown \| first3 = Kevin \| ~~title~~chapter = Sequential ~~model~~Model-~~based~~Based ~~optimization~~Optimization for ~~general~~General ~~algorithm~~Algorithm ~~configuration~~Configuration \| ~~journal~~title = Learning and Intelligent Optimization \| volume = 6683 \| pages = 507–523 Line 104 ⟶ 107: \| bibcode = 2012arXiv1208.3719T \| arxiv = 1208.3719 }}</ref><ref name="krnc">{{Citation \|last=Kernc \|title=SAMBO: Sequential And Model-Based Optimization: Efficient global optimization in Python \|date=2024 \|url=https://zenodo.org/records/14461363 \|access-date=2025-01-30 \|doi=10.5281/zenodo.14461363 }}</ref> to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run. === Gradient-based optimization === For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite book \|last1=Larsen\|first1=Jan\|last2= Hansen \|first2=Lars Kai\|last3=Svarer\|first3=Claus\|last4=Ohlsson\|first4=M\|title=Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop \|chapter=Design and regularization of neural networks: The optimal use of a validation set ~~\|journal=Proceedings of the 1996 IEEE Signal Processing Society Workshop~~\|date=1996\|pages=62–71\|doi=10.1109/NNSP.1996.548336\|isbn=0-7803-3550-3\|citeseerx=10.1.1.415.3266\|s2cid=238874\|chapter-url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal \|author1=Olivier Chapelle \|author2=Vladimir Vapnik \|author3=Olivier Bousquet \|author4=Sayan Mukherjee \|title=Choosing multiple parameters for support vector machines \|journal=Machine Learning \|year=2002 \|volume=46 \|pages=131–159 \|url=http://www.chapelle.cc/olivier/pub/mlj02.pdf \| doi = 10.1023/a:1012450327387 \|doi-access=free }}</ref> or logistic regression.<ref>{{cite journal \|author1 =Chuong B\|author2= Chuan-Sheng Foo\|author3=Andrew Y Ng\|journal = Advances in Neural Information Processing Systems \|volume=20\|title = Efficient multiple hyperparameter learning for log-linear models\|year =2008\|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref> A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal\|last1=Domke\|first1=Justin\|title=Generic Methods for Optimization-Based Modeling\|journal=Aistats\|date=2012\|volume=22\|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf\|access-date=2017-12-09\|archive-date=2014-01-24\|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf\|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv \|last1=Maclaurin\|first1=~~Douglas~~Dougal\|last2=Duvenaud\|first2=David\|last3=Adams\|first3=Ryan P.\|eprint=1502.03492\|title=Gradient-based Hyperparameter Optimization through Reversible Learning\|class=stat.ML\|date=2015}}</ref><ref>{{cite journal \|last1=Franceschi \|first1=Luca \|last2=Donini \|first2=Michele \|last3=Frasconi \|first3=Paolo \|last4=Pontil \|first4=Massimiliano \|title=Forward and Reverse Gradient-Based Hyperparameter Optimization \|journal=Proceedings of the 34th International Conference on Machine Learning \|date=2017 \|arxiv=1703.01785 \|bibcode=2017arXiv170301785F \|url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>~~Shaban,~~{{cite ~~A.,~~arXiv ~~Cheng,~~\| Ceprint=1810.10667 ~~A.,~~\| ~~Hatch,~~last1=Shaban ~~N.,~~\| &first1=Amirreza ~~Boots,~~\| B.last2=Cheng ~~(2019,~~\| ~~April).~~first2=Ching-An ~~[https://arxiv.org/pdf/1810.10667.pdf~~\| ~~Truncated~~last3=Hatch ~~back-propagation~~\| ~~for~~first3=Nathan ~~bilevel~~\| ~~optimization].~~last4=Boots In\| ~~''The~~first4=Byron ~~22nd~~\| ~~International~~title=Truncated ~~Conference~~Back-propagation onfor ~~Artificial~~Bilevel ~~Intelligence~~Optimization ~~and~~\| ~~Statistics''~~date=2018 ~~(pp.~~\| ~~1723-1732)~~class=cs.LG ~~PMLR.~~}}</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.<ref>{{cite arXiv \| eprint=1911.02590 \| last1=Lorraine \| first1=Jonathan \| last2=Vicol \| first2=Paul \| last3=Duvenaud \| first3=David \| title=Optimizing Millions of Hyperparameters by Implicit Differentiation \| date=2019 \| class=cs.LG }}</ref> In a different approach,<ref>~~Lorraine,~~{{cite JarXiv \| eprint=1802.,09419 &\| last1=Lorraine \| first1=Jonathan \| last2=Duvenaud, D.\| ~~(2018).~~first2=David ~~[[arxiv:1802.09419~~\| title=Stochastic ~~hyperparameter~~Hyperparameter ~~optimization~~Optimization through ~~hypernetworks]].~~Hypernetworks ~~''arXiv~~\| ~~preprint~~date=2018 ~~arXiv:1802.09419''~~\| class=cs.LG }}</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>~~MacKay,~~{{cite MarXiv \| eprint=1903.,03088 \| last1=MacKay \| first1=Matthew \| last2=Vicol, ~~P.,~~\| first2=Paul \| last3=Lorraine, ~~J.,~~\| first3=Jon \| last4=Duvenaud, ~~D.,~~\| &first4=David \| last5=Grosse, R.\| ~~(2019).~~first5=Roger ~~[[arxiv:1903.03088~~\| title=Self-~~tuning~~Tuning ~~networks~~Networks: Bilevel ~~optimization~~Optimization of ~~hyperparameters~~Hyperparameters using ~~structured~~Structured ~~best~~Best-~~response~~Response ~~functions]].~~Functions ~~''arXiv~~\| ~~preprint~~date=2019 ~~arXiv:1903.03088''~~\| class=cs.LG }}</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>~~Bae,~~{{cite arXiv J\| eprint=2010.,13514 &\| ~~Grosse,~~last1=Bae R.\| B.first1=Juhan ~~(2020).~~\| last2=Grosse ~~[[arxiv:2010.13514~~\| first2=Roger \| title=Delta-~~stn~~STN: Efficient ~~bilevel~~Bilevel ~~optimization~~Optimization for ~~neural~~Neural ~~networks~~Networks using ~~structured~~Structured ~~response~~Response ~~jacobians]].~~Jacobians ~~''Advances~~\| indate=2020 ~~Neural~~\| ~~Information Processing Systems'', ''33'',~~class=cs.LG ~~21725-21737.~~}}</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights. Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters.<ref>~~Liu,~~{{cite HarXiv \| eprint=1806.,09055 \| last1=Liu \| first1=Hanxiao \| last2=Simonyan, ~~K.,~~\| &first2=Karen \| last3=Yang, Y.\| ~~(2018).~~first3=Yiming ~~[[arxiv:1806.09055~~\|~~Darts~~ title=DARTS: Differentiable ~~architecture~~Architecture ~~search]].~~Search ~~''arXiv~~\| ~~preprint~~date=2018 ~~arXiv:1806.09055''~~\| class=cs.LG }}</ref> Such methods have been extensively used for the optimization of architecture hyperparameters in [[neural architecture search]]. === Evolutionary optimization === Line 121 ⟶ 131: # Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+) # Evaluate the ~~hyperparameters~~hyperparameter tuples and acquire their [[fitness function]] (e.g., 10-fold [[Cross-validation (statistics)\|cross-validation]] accuracy of the machine learning algorithm with those hyperparameters) # Rank the hyperparameter tuples by their relative fitness # Replace the worst-performing hyperparameter tuples with new ~~hyperparameter tuples~~ones generated ~~through~~via [[crossover (genetic algorithm)\|crossover]] and [[mutation (genetic algorithm)\|mutation]] # Repeat steps 2-4 until satisfactory algorithm performance is reached or ~~algorithm performance~~ is no longer improving. Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,<ref name="bergstra11" /> [[automated machine learning]], typical neural network <ref name="kousiouris1">{{cite journal \|vauthors=Kousiouris G, Cuccinotta T, Varvarigou T \| year = 2011 \| title= The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks \| url= https://www.sciencedirect.com/science/article/abs/pii/S0164121211000951 \| journal = Journal of Systems and Software \| volume = 84 \| issue = 8 \| pages = 1270–1291\| doi = 10.1016/j.jss.2011.04.013 \| hdl = 11382/361472 \| hdl-access = free }}</ref> and [[Deep learning#Deep neural networks\|deep neural network]] architecture search,<ref name="miikkulainen1">{{cite arXiv \| vauthors = Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B \| year = 2017 \| title = Evolving Deep Neural Networks \|eprint=1703.00548\| class = cs.NE }}</ref><ref name="jaderberg1">{{cite arXiv \| vauthors = Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K \| year = 2017 \| title = Population Based Training of Neural Networks \|eprint=1711.09846\| class = cs.LG }}</ref> as well as training of the weights in deep neural networks.<ref name="such1">{{cite arXiv \| vauthors = Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J \| year = 2017 \| title = Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning \|eprint=1712.06567\| class = cs.NE }}</ref> Line 134 ⟶ 144: === Early stopping-based === [[File:Successive-halving-for-eight-arbitrary-hyperparameter-configurations.png\|thumb\|Successive halving for eight arbitrary hyperparameter configurations. The approach starts with eight models with different configurations and consecutively applies successive halving until only one model remains.]] A class of early stopping-based hyperparameter optimization algorithms is purpose built for large search spaces of continuous and discrete hyperparameters, particularly when the computational cost to evaluate the performance of a set of hyperparameters is high. Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly.<ref name="irace">{{cite journal \|last1=López-Ibáñez \|first1=Manuel \|last2=Dubois-Lacoste \|first2=Jérémie \|last3=Pérez Cáceres \|first3=Leslie \|last4=Stützle \|first4=Thomas \|last5=Birattari \|first5=Mauro \|date=2016 \|title=The irace package: Iterated Racing for Automatic Algorithm Configuration \|journal=Operations Research Perspective \|volume=3 \|issue=3 \|pages=43–58 \|doi=10.1016/j.orp.2016.09.002\|doi-access=free \|hdl=10419/178265 \|hdl-access=free }}</ref><ref name="race">{{cite journal \|last1=Birattari \|first1=Mauro \|last2=Stützle \|first2=Thomas \|last3=Paquete \|first3=Luis \|last4=Varrentrapp \|first4=Klaus \|date=2002 \|title=A Racing Algorithm for Configuring Metaheuristics \|journal=Gecco 2002 \|pages=11–18}}</ref> Another early stopping hyperparameter optimization algorithm is successive halving (SHA),<ref>{{cite arXiv\|last1=Jamieson\|first1=Kevin\|last2=Talwalkar\|first2=Ameet\|date=2015-02-27\|title=Non-stochastic Best Arm Identification and Hyperparameter Optimization\|eprint=1502.07943\|class=cs.LG}}</ref> which begins as a random search but periodically prunes low-performing models, thereby focusing computational resources on more promising models. Asynchronous successive halving (ASHA)<ref>{{cite arXiv\|last1=Li\|first1=Liam\|last2=Jamieson\|first2=Kevin\|last3=Rostamizadeh\|first3=Afshin\|last4=Gonina\|first4=Ekaterina\|last5=Hardt\|first5=Moritz\|last6=Recht\|first6=Benjamin\|last7=Talwalkar\|first7=Ameet\|date=2020-03-16\|title=A System for Massively Parallel Hyperparameter Tuning\|class=cs.LG\|eprint=1810.05934v5}}</ref> further improves upon SHA's resource utilization profile by removing the need to synchronously evaluate and prune low-performing models. Hyperband<ref>{{cite journal\|last1=Li\|first1=Lisha\|last2=Jamieson\|first2=Kevin\|last3=DeSalvo\|first3=Giulia\|last4=Rostamizadeh\|first4=Afshin\|last5=Talwalkar\|first5=Ameet\|date=2020-03-16\|title=Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization\|journal=Journal of Machine Learning Research\|volume=18\|pages=1–52\|arxiv=1603.06560}}</ref> is a higher level early stopping-based algorithm that invokes SHA or ASHA multiple times with varying levels of pruning aggressiveness, in order to be more widely applicable and with fewer required inputs. === Others === Line 152 ⟶ 162: * [[Self-tuning]] * [[XGBoost]] * [[Optuna]] == References ==