Hyperparameter optimization: Difference between revisions

Content deleted Content added
Population-based: , small comment about what adaptive methods are, I suggest we remplace this whole subsection about PBT to make it about adaptive methods more generally
Tags: Mobile edit Mobile web edit Advanced mobile edit
 
(84 intermediate revisions by 39 users not shown)
Line 1:
{{Short description|Process of finding the optimal set of variables
In [[machine learning]], '''hyperparameter optimization''' or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)|hyperparameters]] for a learning algorithm. A hyperparameter is a [[parameter]] whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.
for a machine learning algorithm}}
In [[machine learning]], '''hyperparameter optimization'''<ref>Matthias Feurer and Frank Hutter. [https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf Hyperparameter optimization]. In: ''AutoML: Methods, Systems, Challenges'', pages 3–38.</ref> or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)|hyperparameters]] for a learning algorithm. A hyperparameter is a [[parameter]] whose value is used to control the learning process, which must be configured before the process starts.<ref>{{cite journal |last1=Yang|first1=Li|title=On hyperparameter optimization of machine learning algorithms: Theory and practice|journal=Neurocomputing|year=2020|volume=415|pages=295–316|doi=10.1016/j.neucom.2020.07.061|arxiv=2007.15745 }}</ref><ref>{{cite arXiv |vauthors=Franceschi L, Donini M, Perrone V, Klein A, Archambeau C, Seeger M, Pontil M, Frasconi P |title=Hyperparameter Optimization in Machine Learning |year=2024 |class=stat.ML |eprint=2410.22854 }}</ref>
 
The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization findsdetermines athe tupleset of hyperparameters that yields an optimal model which minimizes a predefined [[loss function]] on a given independent [[data set]].<ref name=abs1502.02127>{{cite arxivarXiv |eprint=1502.02127|last1=Claesen|first1=Marc|title=Hyperparameter Search in Machine Learning|author2=Bart De Moor|class=cs.LG|year=2015}}</ref> The objective function takes a tupleset of hyperparameters and returns the associated loss.<ref name=abs1502.02127/> [[Cross-validation (statistics)|Cross-validation]] is often used to estimate this generalization performance, and therefore choose the set of values for hyperparameters that maximize it.<ref name="bergstra">{{cite journal|last1=Bergstra|first1=James|last2=Bengio|first2=Yoshua|year=2012|title=Random Search for Hyper-Parameter Optimization|url=http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf|journal=Journal of Machine Learning Research|volume=13|pages=281–305}}</ref>
 
== Approaches ==
Line 8 ⟶ 10:
 
=== Grid search ===
The traditional waymethod of performingfor hyperparameter optimization has been ''grid search'', or a ''parameter sweep'', which is simply an [[Brute-force search|exhaustive searching]] through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by [[Cross-validation (statistics)|cross-validation]] on the training set<ref>Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). [http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf A practical guide to support vector classification]. Technical Report, [[National Taiwan University]].</ref>
or evaluation on a heldhold-out validation set.<ref>{{cite journal
| vauthors = Chicco D
| title = Ten quick tips for machine learning in computational biology
Line 19 ⟶ 21:
| pmid = 29234465
| doi = 10.1186/s13040-017-0155-3
| pmc= 5721660}}</ref>
| doi-access = free
}}</ref>
 
Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.
Line 35 ⟶ 39:
 
=== Random search ===
Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. A benefit over grid search is that random search can explore many more values than grid search could for continuous hyperparameters. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm.<ref name="bergstra" /> In this case, the optimization problem is said to have a low intrinsic dimensionality.<ref>{{Cite journal|lastlast1=Ziyu|firstfirst1=Wang|last2=Frank|first2=Hutter|last3=Masrour|first3=Zoghi|last4=David|first4=Matheson|last5=Nando|first5=de Feitas|date=2016|title=Bayesian Optimization in a Billion Dimensions via Random Embeddings|journal=Journal of Artificial Intelligence Research|language=en|volume=55|pages=361–387|doi=10.1613/jair.4806|arxiv=1301.1942|s2cid=279236}}</ref> Random Search is also [[embarrassingly parallel]], and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample. Despite its simplicity, random search remains one of the important base-lines against which to compare the performance of new hyperparameter optimization methods.
 
[[File:Hyperparameter Optimization using Tree-Structured Parzen Estimators.svg|thumb|Methods such as Bayesian optimization smartly explore the space of potential choices of hyperparameters by deciding which combination to explore next based on previous observations.]]
Line 43 ⟶ 47:
 
Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the ___location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). In practice, Bayesian optimization has been shown<ref name="hutter">{{Citation
| lastlast1 = Hutter
| firstfirst1 = Frank
| last2 = Hoos
| first2 = Holger
| last3 = Leyton-Brown
| first3 = Kevin
| titlechapter = Sequential modelModel-basedBased optimizationOptimization for generalGeneral algorithmAlgorithm configurationConfiguration
| journaltitle = Learning and Intelligent Optimization
| volume = 6683
| pages = 507–523
Line 58 ⟶ 62:
| series = Lecture Notes in Computer Science
| isbn = 978-3-642-25565-6
| s2cid = 6944647
}}</ref><ref name="bergstra11">{{Citation
| lastlast1 = Bergstra
| firstfirst1 = James
| last2 = Bardenet
| first2 = Remi
Line 71 ⟶ 76:
| year = 2011
| url = http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf }}</ref><ref name="snoek">{{cite journal
| lastlast1 = Snoek
| firstfirst1 = Jasper
| last2 = Larochelle
| first2 = Hugo
Line 86 ⟶ 91:
| arxiv = 1206.2944
}}</ref><ref name="thornton">{{cite journal
| lastlast1 = Thornton
| firstfirst1 = Chris
| last2 = Hutter
| first2 = Frank
Line 102 ⟶ 107:
| bibcode = 2012arXiv1208.3719T
| arxiv = 1208.3719
}}</ref><ref name="krnc">{{Citation
|last=Kernc
|title=SAMBO: Sequential And Model-Based Optimization: Efficient global optimization in Python
|date=2024
|url=https://zenodo.org/records/14461363
|access-date=2025-01-30
|doi=10.5281/zenodo.14461363
}}</ref> to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.
 
=== Gradient-based optimization ===
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using [[gradient descent]]. The first usage of these techniques was focused on neural networks.<ref>{{cite journalbook |last1=Larsen|first1=Jan|last2= Hansen |first2=Lars Kai|last3=Svarer|first3=Claus|last4=Ohlsson|first4=M|title=Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop |chapter=Design and regularization of neural networks: theThe optimal use of a validation set|journal=Proceedings of the 1996 IEEE Signal Processing Society Workshop|date=1996|pages=62–71|doi=10.1109/NNSP.1996.548336|isbn=0-7803-3550-3|citeseerx=10.1.1.415.3266|s2cid=238874|chapter-url=http://orbit.dtu.dk/files/4545571/Svarer.pdf}}</ref> Since then, these methods have been extended to other models such as [[support vector machine]]s<ref>{{cite journal |author1=Olivier Chapelle |author2=Vladimir Vapnik |author3=Olivier Bousquet |author4=Sayan Mukherjee |title=Choosing multiple parameters for support vector machines |journal=Machine Learning |year=2002 |volume=46 |pages=131–159 |url=http://www.chapelle.cc/olivier/pub/mlj02.pdf | doi = 10.1023/a:1012450327387 |doi-access=free }}</ref> or logistic regression.<ref>{{cite journal |author1 =Chuong B|author2= Chuan-Sheng Foo|author3=Andrew Y Ng|journal = Advances in Neural Information Processing Systems |volume=20|title = Efficient multiple hyperparameter learning for log-linear models|year =2008|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf}}</ref>
 
A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using [[automatic differentiation]].<ref>{{cite journal|last1=Domke|first1=Justin|title=Generic Methods for Optimization-Based Modeling|journal=Aistats |date=2012|volume=22|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|access-date=2017-12-09|archive-date=2014-01-24|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv |last1=Maclaurin|first1=DouglasDougal|last2=Duvenaud|first2=David|last3=Adams|first3=Ryan P.|eprint=1502.03492|title=Gradient-based Hyperparameter Optimization through Reversible Learning|class=stat.ML|date=2015}}</ref> <ref>{{cite journal |last1=Franceschi |first1=Luca |last2=Donini |first2=Michele |last3=Frasconi |first3=Paolo |last4=Pontil |first4=Massimiliano |title=Forward and Reverse Gradient-Based Hyperparameter Optimization |journal=Proceedings of the 34th International Conference on Machine Learning |date=2017 |arxiv=1703.01785 |bibcode=2017arXiv170301785F |url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf}}</ref><ref>{{cite arXiv | eprint=1810.10667 | last1=Shaban | first1=Amirreza | last2=Cheng | first2=Ching-An | last3=Hatch | first3=Nathan | last4=Boots | first4=Byron | title=Truncated Back-propagation for Bilevel Optimization | date=2018 | class=cs.LG }}</ref> A more recent work along this direction uses the [[implicit function theorem]] to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory.<ref>{{cite arXiv | eprint=1911.02590 | last1=Lorraine | first1=Jonathan | last2=Vicol | first2=Paul | last3=Duvenaud | first3=David | title=Optimizing Millions of Hyperparameters by Implicit Differentiation | date=2019 | class=cs.LG }}</ref>
 
In a different approach,<ref>{{cite arXiv | eprint=1802.09419 | last1=Lorraine | first1=Jonathan | last2=Duvenaud | first2=David | title=Stochastic Hyperparameter Optimization through Hypernetworks | date=2018 | class=cs.LG }}</ref> a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks<ref>{{cite arXiv | eprint=1903.03088 | last1=MacKay | first1=Matthew | last2=Vicol | first2=Paul | last3=Lorraine | first3=Jon | last4=Duvenaud | first4=David | last5=Grosse | first5=Roger | title=Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions | date=2019 | class=cs.LG }}</ref> offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN<ref>{{cite arXiv | eprint=2010.13514 | last1=Bae | first1=Juhan | last2=Grosse | first2=Roger | title=Delta-STN: Efficient Bilevel Optimization for Neural Networks using Structured Response Jacobians | date=2020 | class=cs.LG }}</ref> has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights.
 
Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters.<ref>{{cite arXiv | eprint=1806.09055 | last1=Liu | first1=Hanxiao | last2=Simonyan | first2=Karen | last3=Yang | first3=Yiming | title=DARTS: Differentiable Architecture Search | date=2018 | class=cs.LG }}</ref> Such methods have been extensively used for the optimization of architecture hyperparameters in [[neural architecture search]].
 
=== Evolutionary optimization ===
Line 115 ⟶ 131:
 
# Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+)
# Evaluate the hyperparametershyperparameter tuples and acquire their [[fitness function]] (e.g., 10-fold [[Cross-validation (statistics)|cross-validation]] accuracy of the machine learning algorithm with those hyperparameters)
# Rank the hyperparameter tuples by their relative fitness
# Replace the worst-performing hyperparameter tuples with new hyperparameter tuplesones generated throughvia [[crossover (genetic algorithm)|crossover]] and [[mutation (genetic algorithm)|mutation]]
# Repeat steps 2-4 until satisfactory algorithm performance is reached or algorithm performance is no longer improving.
 
Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,<ref name="bergstra11" /> [[automated machine learning]], typical neural network <ref name="kousiouris1"> {{cite journal |vauthors=Kousiouris G, Cuccinotta T, Varvarigou T | year = 2011 | title= The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks | url= https://www.sciencedirect.com/science/article/abs/pii/S0164121211000951 | journal = Journal of Systems and Software | volume = 84 | issue = 8 | pages = 12701270–1291| doi = 10.1016/j.jss.2011.04.013 | hdl = 11382/361472 | hdl-1291access = free }}</ref> and [[Deep learning#Deep neural networks|deep neural network]] architecture search,<ref name="miikkulainen1">{{cite arxivarXiv | vauthors = Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B | year = 2017 | title = Evolving Deep Neural Networks |eprint=1703.00548| class = cs.NE }}</ref><ref name="jaderberg1">{{cite arxivarXiv | vauthors = Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K | year = 2017 | title = Population Based Training of Neural Networks |eprint=1711.09846| class = cs.LG }}</ref> as well as training of the weights in deep neural networks.<ref name="such1">{{cite arxivarXiv | vauthors = Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J | year = 2017 | title = Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning |eprint=1712.06567| class = cs.NE }}</ref>
 
=== Population-based ===
Population Based Training (PBT) learns both hyperparameter values and network weights. Multiple learning processes operate independently, using different hyperparameters. As with evolutionary methods, poorly performing models are iteratively replaced with models that adopt modified hyperparameter values and weights based on the better performers. This replacement model warm starting is the primary differentiator between PBT and other evolutionary methods. PBT thus allows the hyperparameters to evolve and eliminates the need for manual hypertuning. The process makes no assumptions regarding model architecture, loss functions or training procedures.

PBT and its variants are adaptive methods: they update hyperparameters during the training of the models. On the contrary, non-adaptive methods have the susub-optimal strategy to assign a constant set of hyperparameters for the whole training. <ref>{{cite arxivarXiv|lastlast1=Li|firstfirst1=Ang|last2=Spyra|first2=Ola|last3=Perel|first3=Sagi|last4=Dalibard|first4=Valentin|last5=Jaderberg|first5=Max|last6=Gu|first6=Chenjie|last7=Budden|first7=David|last8=Harley|first8=Tim|last9=Gupta|first9=Pramod|date=2019-02-05|title=A Generalized Framework for Population Based Training|eprint=1902.01894|class=cs.AI}}</ref>
 
=== Early stopping-based ===
[[File:Successive-halving-for-eight-arbitrary-hyperparameter-configurations.png|thumb|Successive halving for eight arbitrary hyperparameter configurations. The approach starts with eight models with different configurations and consecutively applies successive halving until only one model remains.]]
A class of early stopping-based hyperparameter optimization algorithms is purpose built for large search spaces of continuous and discrete hyperparameters, particularly when the computational cost to evaluate the performance of a set of hyperparameters is high. Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly.<ref name="irace">{{cite journal |last1=López-Ibáñez |first1=Manuel |last2=Dubois-Lacoste |first2=Jérémie |last3=Pérez Cáceres |first3=Leslie |last4=Stützle |first4=Thomas |last5=Birattari |first5=Mauro |date=2016 |title=The irace package: Iterated Racing for Automatic Algorithm Configuration |journal=Operations Research Perspective |volume=3 |issue=3 |pages=43-5843–58 |doi=10.1016/j.orp.2016.09.002|doi-access=free |hdl=10419/178265 |hdl-access=free }}</ref><ref name="race">{{cite journal |last1=Birattari |first1=Mauro |last2=Stützle |first2=Thomas |last3=Paquete |first3=Luis |last4=Varrentrapp |first4=Klaus |date=2002 |title=A Racing Algorithm for Configuring Metaheuristics |journal=GECCOGecco 2002 |pages=11-1811–18}}</ref>
Another early stopping hyperparameter optimization algorithm is successive halving (SHA),<ref>{{cite arxivarXiv|lastlast1=Jamieson|firstfirst1=Kevin|last2=Talwalkar|first2=Ameet|date=2015-02-27|title=Non-stochastic Best Arm Identification and Hyperparameter Optimization|eprint=1502.07943|class=cs.LG}}</ref> which begins as a random search but periodically prunes low-performing models, thereby focusing computational resources on more promising models. Asynchronous successive halving (ASHA)<ref>{{cite arxivarXiv|lastlast1=Li|firstfirst1=Liam|last2=Jamieson|first2=Kevin|last3=Rostamizadeh|first3=Afshin|last4=Gonina|first4=Ekaterina|last5=Hardt|first5=Moritz|last6=Recht|first6=Benjamin|last7=Talwalkar|first7=Ameet|date=2020-03-16|title=A System for Massively Parallel Hyperparameter Tuning|class=cs.LG|eprint=1810.05934v5}}</ref> further improves upon SHA's resource utilization profile by removing the need to synchronously evaluate and prune low-performing models. Hyperband<ref>{{cite arxivjournal|lastlast1=Li|firstfirst1=Lisha|last2=Jamieson|first2=Kevin|last3=DeSalvo|first3=Giulia|last4=Rostamizadeh|first4=Afshin|last5=Talwalkar|first5=Ameet|date=2020-03-16|title=Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization|eprintjournal=Journal of Machine Learning Research|volume=18|pages=1–52|arxiv=1603.06560v406560}}</ref> is a higher level early stopping-based algorithm that invokes SHA or ASHA multiple times with varying levels of pruning aggressiveness, in order to be more widely applicable and with fewer required inputs.
 
=== Others ===
[[Radial basis function|RBF]]<ref name=abs1705.08520>{{cite arxivarXiv |eprint=1705.08520|last1=Diaz|first1=Gonzalo|title=An effective algorithm for hyperparameter optimization of neural networks|last2=Fokoue|first2=Achille|last3=Nannicini|first3=Giacomo|last4=Samulowitz|first4=Horst|class=cs.AI|year=2017}}</ref> and [[spectral method|spectral]]<ref name=abs1706.00764>{{cite arxivarXiv |eprint=1706.00764|last1=Hazan|first1=Elad|title=Hyperparameter Optimization: A Spectral Approach|last2=Klivans|first2=Adam|last3=Yuan|first3=Yang|class=cs.LG|year=2017}}</ref> approaches have also been developed.
 
== Open-source software ==
 
===Grid search===
*[https://github.com/determined-ai/determined Determined], a DL Training Platform includes grid search for PyTorch and TensorFlow (Keras and Estimator) models.
*[http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html H2O AutoML] provides grid search over algorithms in the H2O open source machine learning library.
*[https://github.com/kubeflow/katib Katib] is a Kubernetes-native system that includes grid search.
*[[scikit-learn]] is a Python package that includes [http://scikit-learn.sourceforge.net/modules/grid_search.html grid] search.
*[https://github.com/autonomio/talos Talos] includes grid search for [[Keras]].
*[https://ray.readthedocs.io/en/latest/tune.html Tune] is a Python library for distributed hyperparameter tuning and supports grid search.
 
===Random search===
*[https://github.com/determined-ai/determined Determined] is a DL Training Platform that supports random search for PyTorch and TensorFlow (Keras and Estimator) models.
* [https://github.com/hyperopt/hyperopt hyperopt], also via [https://github.com/maxpumperla/hyperas hyperas] and [https://github.com/hyperopt/hyperopt-sklearn hyperopt-sklearn], are Python packages which include random search.
*[https://github.com/kubeflow/katib Katib] is a Kubernetes-native system that includes random search.
* [[scikit-learn]] is a Python package which includes [https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html random] search.
* [[caret]] is a R package which includes [http://topepo.github.io/caret/random-hyperparameter-search.html grid & random] search.
*[https://github.com/autonomio/talos Talos] includes a customizable random search for [[Keras]].
*[https://ray.readthedocs.io/en/latest/tune.html Tune] is a Python library for distributed hyperparameter tuning and supports random search over arbitrary parameter distributions.
 
===Bayesian===
* [https://github.com/automl/auto-sklearn Auto-sklearn]<ref name="autosklearn">{{cite journal | vauthors = Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F | year = 2015 | title = Efficient and Robust Automated Machine Learning | url = https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning | journal = Advances in Neural Information Processing Systems 28 (NIPS 2015) | pages = 2962–2970 }}</ref> is a Bayesian hyperparameter optimization layer on top of [[scikit-learn]].
* [https://github.com/facebook/Ax Ax]<ref name=AxBoTorch>{{cite web |url=https://ai.facebook.com/blog/open-sourcing-ax-and-botorch-new-ai-tools-for-adaptive-experimentation/ |title=Open-sourcing Ax and BoTorch: New AI tools for adaptive experimentation |year=2019}}</ref> is a Python-based experimentation platform that supports Bayesian optimization and bandit optimization as exploration strategies.
* [https://github.com/baptistar/BOCS BOCS] is a Matlab package which uses [[semidefinite programming]] for minimizing a black-box function over discrete inputs.<ref name="arXiv:1806.08838">{{cite arXiv |year=2018 |title=Bayesian Optimization of Combinatorial Structures |eprint=1806.08838|last1=Baptista |first1=Ricardo |last2=Poloczek |first2=Matthias |class=stat.ML }}</ref> A Python 3 implementation is also included.
* [https://github.com/automl/HpBandSter HpBandSter] is a Python package which combines Bayesian optimization with bandit-based methods.<ref name="arXiv:1807.01774">{{cite arXiv |year=2018 |title=BOHB: Robust and Efficient Hyperparameter Optimization at Scale |eprint=1807.01774|last1=Falkner |first1=Stefan |last2=Klein |first2=Aaron |last3=Hutter |first3=Frank |class=stat.ML }}</ref>
*[https://github.com/kubeflow/katib Katib] is a Kubernetes-native system which includes bayesian optimization.
*[https://github.com/mlr-org/mlrMBO mlrMBO], also with [https://github.com/mlr-org/mlr mlr], is an [[R (programming language)|R]] package for model-based/Bayesian optimization of black-box functions.
*[https://optuna.readthedocs.io/en/latest/ optuna] is a Python package for black box optimization, compatible with arbitrary functions that need to be optimized.
* [https://github.com/scikit-optimize/scikit-optimize scikit-optimize] is a Python package or sequential model-based optimization with a scipy.optimize interface.<ref name=skopt>{{Cite web|url=https://scikit-optimize.github.io/|title=skopt API documentation|website=scikit-optimize.github.io}}</ref>
* [https://github.com/automl/SMAC3 SMAC] SMAC is a Python/Java library implementing Bayesian optimization.<ref name="SMAC">{{cite journal | vauthors = Hutter F, Hoos HH, Leyton-Brown K | title = Sequential Model-Based Optimization for General Algorithm Configuration | url = https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf | journal = Proceedings of the Conference on Learning and Intelligent OptimizatioN (LION 5)}}</ref>
* [https://github.com/PhilippPro/tuneRanger tuneRanger] is an R package for tuning random forests using model-based optimization.
 
===Gradient-based optimization===
* [https://github.com/lucfra/FAR-HO FAR-HO] is a Python package containing Tensorflow implementations and wrappers for gradient-based hyperparamteter optimization with forward and reverse mode algorithmic differentiation.
* [https://github.com/dmlc/xgboost XGBoost] is an open-source software library that provides a gradient boosting framework for C++, Java, Python, R, and Julia.
 
===Evolutionary===
* [https://github.com/DEAP/deap deap] is a Python framework for general evolutionary computation which is flexible and integrates with parallelization packages like [https://github.com/soravux/scoop scoop] and [[PySpark|pyspark]], and other Python frameworks like [[Scikit-learn|sklearn]] via [https://github.com/rsteca/sklearn-deap sklearn-deap].
*[https://github.com/determined-ai/determined Determined] is a DL Training Platform that supports PBT for optimizing PyTorch and TensorFlow (Keras and Estimator) models.
* [https://github.com/joeddav/devol devol] is a Python package that performs Deep Neural Network architecture search using [[genetic programming]].
* [https://github.com/facebookresearch/nevergrad nevergrad]<ref name=nevergrad_issue1/> is a Python package which includes [[Differential_evolution]], [[Evolution_strategy]], [[Bayesian_optimization]], population control methods for the noisy case and [[Particle_swarm_optimization]].<ref name=nevergrad/>
*[https://ray.readthedocs.io/en/latest/tune.html Tune] is a Python library for distributed hyperparameter tuning and leverages [https://github.com/facebookresearch/nevergrad nevergrad] for evolutionary algorithm support.
 
===Early Stopping===
*[https://github.com/determined-ai/determined Determined] is a DL Training Platform that supports Hyperband for PyTorch and TensorFlow (Keras and Estimator) models.
* [https://iridia.ulb.ac.be/irace/ irace] is an R package that implements the iterated racing algorithm.<ref name="irace">{{cite journal |last1=López-Ibáñez |first1=Manuel |last2=Dubois-Lacoste |first2=Jérémie |last3=Pérez Cáceres |first3=Leslie |last4=Stützle |first4=Thomas |last5=Birattari |first5=Mauro |date=2016 |title=The irace package: Iterated Racing for Automatic Algorithm Configuration |journal=Operations Research Perspective |issue=3 |pages=43-58 |doi=10.1016/j.orp.2016.09.002|doi-access=free }}</ref><ref name="race">{{cite journal |last1=Birattari |first1=Mauro |last2=Stützle |first2=Thomas |last3=Paquete |first3=Luis |last4=Varrentrapp |first4=Klaus |date=2002 |title=A Racing Algorithm for Configuring Metaheuristics |journal=GECCO 2002 |pages=11-18}}</ref>
*[https://github.com/kubeflow/katib Katib] is a Kubernetes-native system that includes hyperband.
 
== Issues with hyperparameter optimization ==
===Other===
*[https://github.com/determined-ai/determined Determined] is a DL Training Platform that supports random, grid, PBT, Hyperband and NAS approaches to hyperparameter optimization for PyTorch and TensorFlow (Keras and Estimator) models.
* [[dlib]]<ref name=dlib_github>{{Cite web|url=https://github.com/davisking/dlib|title=A toolkit for making real world machine learning and data analysis applications in C++: davisking/dlib|date=February 25, 2019|via=GitHub}}</ref> is a C++ package with a Python API which has a parameter-free optimizer based on [https://arxiv.org/abs/1703.02628 LIPO] and [[trust region]] optimizers working in tandem.<ref name=dlib_blog>{{cite web |last1=King |first1=Davis |title=A Global Optimization Algorithm Worth Using |url=http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html}}</ref>
* [https://github.com/callowbird/Harmonica Harmonica] is a Python package for spectral hyperparameter optimization.<ref name=abs1706.00764/>
* [https://github.com/hyperopt/hyperopt hyperopt], also via [https://github.com/maxpumperla/hyperas hyperas] and [https://github.com/hyperopt/hyperopt-sklearn hyperopt-sklearn], are Python packages which include [[kernel density estimation|Tree of Parzen Estimators]] based distributed hyperparameter optimization.
*[https://github.com/kubeflow/katib/ Katib] is a Kubernetes-native system which includes grid, random search, bayesian optimization, hyperband, and NAS based on reinforcement learning.
* [https://github.com/facebookresearch/nevergrad nevergrad]<ref name=nevergrad_issue1>{{Cite web|url=https://github.com/facebookresearch/nevergrad/issues/1|title=[QUESTION] How to use to optimize NN hyperparameters · Issue #1 · facebookresearch/nevergrad|website=GitHub}}</ref> is a Python package for gradient-free optimization using techniques such as differential evolution, sequential quadratic programming, fastGA, covariance matrix adaptation, population control methods, and particle swarm optimization.<ref name=nevergrad>{{Cite web|url=https://code.fb.com/ai-research/nevergrad/|title=Nevergrad: An open source tool for derivative-free optimization|date=December 20, 2018}}</ref>
* [[Neural Network Intelligence]] (NNI) is a Python package which includes hyperparameter tuning for neural networks in local and distributed environments. Its techniques include TPE, random, anneal, evolution, SMAC, batch, grid, and hyperband.
* [https://github.com/sherpa-ai/sherpa parameter-sherpa] is a similar Python package which includes several techniques grid search, Bayesian and genetic Optimization
* [https://github.com/wwu-mmll/photonai photonai] is a high level Python API for designing and optimizing machine learning pipelines based on grid, random search and bayesian optimization.
* [https://github.com/CMA-ES/pycma pycma] is a Python implementation of [[CMA-ES|Covariance Matrix Adaptation Evolution Strategy]].
* [https://github.com/coin-or/rbfopt rbfopt] is a Python package that uses a [[radial basis function]] model<ref name=abs1705.08520/>
*[https://ray.readthedocs.io/en/latest/tune.html Tune] is a Python library for hyperparameter tuning execution and integrates with/scales many existing hyperparameter optimization libraries such as [https://github.com/hyperopt/hyperopt hyperopt], [https://github.com/facebookresearch/nevergrad nevergrad], and [https://github.com/scikit-optimize/scikit-optimize scikit-optimize].
 
When hyperparameter optimization is done, the set of hyperparameters are often fitted on a training set and selected based on the generalization performance, or score, of a validation set. However, this procedure is at risk of overfitting the hyperparameters to the validation set. Therefore, the generalization performance score of the validation set (which can be several sets in the case of a cross-validation procedure) cannot be used to simultaneously estimate the generalization performance of the final model. In order to do so, the generalization performance has to be evaluated on a set independent (which has no intersection) of the set (or sets) used for the optimization of the hyperparameters, otherwise the performance might give a value which is too optimistic (too large). This can be done on a second test set, or through an outer [[Cross-validation (statistics)|cross-validation]] procedure called nested cross-validation, which allows an unbiased estimation of the generalization performance of the model, taking into account the bias due to the hyperparameter optimization.
== Commercial services ==
* [https://aws.amazon.com/sagemaker/ Amazon Sagemaker] uses Gaussian processes to tune hyperparameters.
* [https://bigml.com/api/optimls BigML OptiML] supports mixed search domains
* [https://cloud.google.com/ml-engine/docs/tensorflow/using-hyperparameter-tuning Google HyperTune] supports mixed search domains
* [https://indiesolver.com Indie Solver] supports multiobjective, multifidelity and constraint optimization
* [https://mindfoundry.ai/OPTaaS Mind Foundry OPTaaS] supports mixed search domains, multiobjective, constraints, parallel optimization and surrogate models.
* [https://sigopt.com SigOpt] supports mixed search domains, multiobjective, multisolution, multifidelity, constraint (linear and black-box), and parallel optimization.
 
== See also ==
Line 208 ⟶ 162:
* [[Self-tuning]]
* [[XGBoost]]
* [[Optuna]]
 
== References ==