Revision as of 21:19, 17 March 2018 edit Acyclic (talk \| contribs) Extended confirmed users 564 edits Revert commercial spam to last version by DMacks. Open source entries are required. ← Previous edit		Revision as of 11:19, 28 March 2018 edit undo 132.230.194.153 (talk) Simplified/clarified Grid and Random Search Tag: Visual edit Next edit →
Line 1: In [[machine learning]], '''hyperparameter optimization''' or tuning is the problem of choosing a set of optimal [[Hyperparameter (machine learning)\|hyperparameters]] for a learning algorithm. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined [[loss function]] on given independent data.<ref name=abs1502.02127>{{cite arxiv \|eprint=1502.02127\|last1=Claesen\|first1=Marc\|title=Hyperparameter Search in Machine Learning\|author2=Bart De Moor\|class=cs.LG\|year=2015}}</ref> The objective function takes a tuple of hyperparameters and returns the associated loss.<ref name=abs1502.02127/> [[Cross-validation (statistics)\|Cross-validation]] is often used to estimate this generalization performance.<ref name="bergstra">{{cite journal\|last1=Bergstra\|first1=James\|last2=Bengio\|first2=Yoshua\|year=2012\|title=Random Search for Hyper-Parameter Optimization\|url=http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf\|journal=J. Machine Learning Research\|volume=13\|pages=281–305}}</ref> == Approaches == Line 33: {{main article\|Random search}} Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm<ref name="bergstra" />. In this case, the optimization problem is said to have a low intrinsic dimensionality<ref>{{Cite journal\|last=Ziyu\|first=Wang,\|last2=Frank\|first2=Hutter,\|last3=Masrour\|first3=Zoghi,\|last4=David\|first4=Matheson,\|last5=Nando\|first5=de Feitas,\|date=2016\|title=Bayesian Optimization in a Billion Dimensions via Random Embeddings\|url=http://jair.org/papers/paper4806.html\|journal=Journal of Artificial Intelligence Research\|language=en\|volume=55\|doi=10.1613/jair.4806}}</ref>. Random Search is also [[embarrassingly parallel]], and additionally allows to include prior knowledge by specifying the distribution from which to sample. Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more effective in high-dimensional spaces than exhaustive search. This is because oftentimes, it turns out some hyperparameters do not significantly affect the loss. Therefore, having randomly dispersed data gives more "textured" data than an exhaustive search over parameters that ultimately do not affect the loss.<ref name="bergstra">{{cite journal ~~\| title = Random Search for Hyper-Parameter Optimization~~ ~~\| first1 = James~~ ~~\| last1 = Bergstra~~ ~~\| first2 = Yoshua~~ ~~\| last2 = Bengio~~ ~~\| journal = J. Machine Learning Research~~ ~~\| volume = 13~~ ~~\| year = 2012~~ ~~\| pages = 281–305~~ ~~\| url = http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf~~ ~~}}</ref>~~ === Bayesian optimization === {{main article\|Bayesian optimization}} Bayesian optimization is a ~~methodology for the~~ global optimization ofmethod for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization ~~consists of developing~~builds a ~~statistical~~probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By ~~Intuitively,~~iteratively ~~the~~evaluating ~~methodology~~a ~~assumes~~promising ~~that~~hyperparameter ~~there~~configuration isbased ~~some~~on ~~smooth~~the ~~but~~current ~~noisy~~model, ~~function~~and ~~that~~then ~~acts~~updating ~~as a mapping from hyperparameters to the objective. In~~it, Bayesian optimization, ~~one~~ aims to gather observations ~~in such a manner as to evaluate the machine learning model the least number of times while~~ revealing as much information as possible about this function and, in particular, the ___location of the optimum. It Bayesian optimization relies on assuming a very general prior over functions which when combined with observed hyperparameter values and corresponding outputs yields a distribution over functions. The methodology proceeds by iteratively picking hyperparameterstries to ~~observe (experiments to run) in a manner that trades off~~balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters ~~which~~expected ~~are expected~~close to ~~have~~the ~~a good outcome~~optimum). In practice, Bayesian optimization has been shown<ref name="hutter">{{Citation \| last = Hutter \| first = Frank Line 105 ⟶ 94: \| arxiv = 1208.3719 \| class = cs.LG }}</ref> to obtain better results in fewer ~~experiments~~evaluations ~~than~~compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run. === Gradient-based optimization ===

Hyperparameter optimization: Difference between revisions