Content deleted Content added
→Population-based: , small comment about what adaptive methods are, I suggest we remplace this whole subsection about PBT to make it about adaptive methods more generally |
Maxeto0910 (talk | contribs) per WP:HOWTOSD Tags: Mobile edit Mobile web edit Advanced mobile edit |
||
(84 intermediate revisions by 39 users not shown) | |||
Line 1:
{{Short description|Process of finding the optimal set of variables
for a machine learning algorithm}}
In [[machine learning]], '''hyperparameter optimization'''<ref>Matthias Feurer and Frank Hutter. [https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf Hyperparameter optimization]. In: ''AutoML: Methods, Systems, Challenges'', pages 3–38.</ref> or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)|hyperparameters]] for a learning algorithm. A hyperparameter is a [[parameter]] whose value is used to control the learning process, which must be configured before the process starts.<ref>{{cite journal |last1=Yang|first1=Li|title=On hyperparameter optimization of machine learning algorithms: Theory and practice|journal=Neurocomputing|year=2020|volume=415|pages=295–316|doi=10.1016/j.neucom.2020.07.061|arxiv=2007.15745 }}</ref><ref>{{cite arXiv |vauthors=Franceschi L, Donini M, Perrone V, Klein A, Archambeau C, Seeger M, Pontil M, Frasconi P |title=Hyperparameter Optimization in Machine Learning |year=2024 |class=stat.ML |eprint=2410.22854 }}</ref>
== Approaches ==
Line 8 ⟶ 10:
=== Grid search ===
The traditional
or evaluation on a
| vauthors = Chicco D
| title = Ten quick tips for machine learning in computational biology
Line 19 ⟶ 21:
| pmid = 29234465
| doi = 10.1186/s13040-017-0155-3
| pmc= 5721660
| doi-access = free
}}</ref>
Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.
Line 35 ⟶ 39:
=== Random search ===
Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. A benefit over grid search is that random search can explore many more values than grid search could for continuous hyperparameters. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm.<ref name="bergstra" /> In this case, the optimization problem is said to have a low intrinsic dimensionality.<ref>{{Cite journal|
[[File:Hyperparameter Optimization using Tree-Structured Parzen Estimators.svg|thumb|Methods such as Bayesian optimization smartly explore the space of potential choices of hyperparameters by deciding which combination to explore next based on previous observations.]]
Line 43 ⟶ 47:
Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the ___location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). In practice, Bayesian optimization has been shown<ref name="hutter">{{Citation
|
|
| last2 = Hoos
| first2 = Holger
| last3 = Leyton-Brown
| first3 = Kevin
|
|
| volume = 6683
| pages = 507–523
Line 58 ⟶ 62:
| series = Lecture Notes in Computer Science
| isbn = 978-3-642-25565-6
| s2cid = 6944647
}}</ref><ref name="bergstra11">{{Citation
|
|
| last2 = Bardenet
| first2 = Remi
Line 71 ⟶ 76:
| year = 2011
| url = http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf }}</ref><ref name="snoek">{{cite journal
|
|
| last2 = Larochelle
| first2 = Hugo
Line 86 ⟶ 91:
| arxiv = 1206.2944
}}</ref><ref name="thornton">{{cite journal
|
|
| last2 = Hutter
| first2 = Frank
Line 102 ⟶ 107:
| bibcode = 2012arXiv1208.3719T
| arxiv = 1208.3719
}}</ref><ref name="krnc">{{Citation
|last=Kernc
|title=SAMBO: Sequential And Model-Based Optimization: Efficient global optimization in Python
|date=2024
|url=https://zenodo.org/records/14461363
|access-date=2025-01-30
|doi=10.5281/zenodo.14461363
}}</ref> to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.
=== Gradient-based optimization ===
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite
A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal|last1=Domke|first1=Justin|title=Generic Methods for Optimization-Based Modeling|journal=Aistats
In a different approach,<ref>{{cite arXiv | eprint=1802.09419 | last1=Lorraine | first1=Jonathan | last2=Duvenaud | first2=David | title=Stochastic Hyperparameter Optimization through Hypernetworks | date=2018 | class=cs.LG }}</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>{{cite arXiv | eprint=1903.03088 | last1=MacKay | first1=Matthew | last2=Vicol | first2=Paul | last3=Lorraine | first3=Jon | last4=Duvenaud | first4=David | last5=Grosse | first5=Roger | title=Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions | date=2019 | class=cs.LG }}</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>{{cite arXiv | eprint=2010.13514 | last1=Bae | first1=Juhan | last2=Grosse | first2=Roger | title=Delta-STN: Efficient Bilevel Optimization for Neural Networks using Structured Response Jacobians | date=2020 | class=cs.LG }}</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights.
Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters.<ref>{{cite arXiv | eprint=1806.09055 | last1=Liu | first1=Hanxiao | last2=Simonyan | first2=Karen | last3=Yang | first3=Yiming | title=DARTS: Differentiable Architecture Search | date=2018 | class=cs.LG }}</ref> Such methods have been extensively used for the optimization of architecture hyperparameters in [[neural architecture search]].
=== Evolutionary optimization ===
Line 115 ⟶ 131:
# Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+)
# Evaluate the
# Rank the hyperparameter tuples by their relative fitness
# Replace the worst-performing hyperparameter tuples with new
# Repeat steps 2-4 until satisfactory algorithm performance is reached or
Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,<ref name="bergstra11" /> [[automated machine learning]], typical neural network <ref name="kousiouris1">
=== Population-based ===
Population Based Training (PBT) learns both hyperparameter values and network weights. Multiple learning processes operate independently, using different hyperparameters. As with evolutionary methods, poorly performing models are iteratively replaced with models that adopt modified hyperparameter values and weights based on the better performers. This replacement model warm starting is the primary differentiator between PBT and other evolutionary methods. PBT thus allows the hyperparameters to evolve and eliminates the need for manual hypertuning. The process makes no assumptions regarding model architecture, loss functions or training procedures.
PBT and its variants are adaptive methods: they update hyperparameters during the training of the models. On the contrary, non-adaptive methods have the === Early stopping-based ===
[[File:Successive-halving-for-eight-arbitrary-hyperparameter-configurations.png|thumb|Successive halving for eight arbitrary hyperparameter configurations. The approach starts with eight models with different configurations and consecutively applies successive halving until only one model remains.]]
A class of early stopping-based hyperparameter optimization algorithms is purpose built for large search spaces of continuous and discrete hyperparameters, particularly when the computational cost to evaluate the performance of a set of hyperparameters is high. Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly.<ref name="irace">{{cite journal |last1=López-Ibáñez |first1=Manuel |last2=Dubois-Lacoste |first2=Jérémie |last3=Pérez Cáceres |first3=Leslie |last4=Stützle |first4=Thomas |last5=Birattari |first5=Mauro |date=2016 |title=The irace package: Iterated Racing for Automatic Algorithm Configuration |journal=Operations Research Perspective |volume=3 |issue=3 |pages=
Another early stopping hyperparameter optimization algorithm is successive halving (SHA),<ref>{{cite
=== Others ===
[[Radial basis function|RBF]]<ref name=abs1705.08520>{{cite
== Issues with hyperparameter optimization ==
When hyperparameter optimization is done, the set of hyperparameters are often fitted on a training set and selected based on the generalization performance, or score, of a validation set. However, this procedure is at risk of overfitting the hyperparameters to the validation set. Therefore, the generalization performance score of the validation set (which can be several sets in the case of a cross-validation procedure) cannot be used to simultaneously estimate the generalization performance of the final model. In order to do so, the generalization performance has to be evaluated on a set independent (which has no intersection) of the set (or sets) used for the optimization of the hyperparameters, otherwise the performance might give a value which is too optimistic (too large). This can be done on a second test set, or through an outer [[Cross-validation (statistics)|cross-validation]] procedure called nested cross-validation, which allows an unbiased estimation of the generalization performance of the model, taking into account the bias due to the hyperparameter optimization.
== See also ==
Line 208 ⟶ 162:
* [[Self-tuning]]
* [[XGBoost]]
* [[Optuna]]
== References ==
|