Bayesian optimization: Difference between revisions

Content deleted Content added
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.5
 
(40 intermediate revisions by 19 users not shown)
Line 1:
{{Short description|Statistical optimization technique}}
'''Bayesian optimization''' is a [[sequential analysis|sequential design]] strategy for [[global optimization]] of [[Black box|black-box]] functions,<ref name="Mockus1989"/><ref name="garnett_bayesopt_2022">{{Cite book |last=Garnett |first=Roman |url=https://bayesoptbook.com |title=Bayesian Optimization |date=2023 |publisher=Cambridge University Press |isbn=978-1-108-42578-0 }}</ref><ref name="HennigOsborneKersting2022">{{cite book | author1 = Hennig, P.
| author2 = Osborne, M. A.
| author3 = Kersting, H. P.
Line 9:
| isbn=978-1107163447
| url = https://www.probabilistic-numerics.org/assets/ProbabilisticNumerics.pdf
}}</ref> that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions. With the rise of [[artificial intelligence]] innovation in the 21st century, Bayesian optimizations have found prominent use in [[machine learning]] problems for [[Hyperparameter optimization|optimizing hyperparameter values]].<ref>{{cite journal |first=Jasper |last=Snoek |title=Practical Bayesian Optimization of Machine Learning Algorithms |journal=Advances in Neural Information Processing Systems 25 (NIPS 2012) |year=2012 |volume=25 |arxiv=1206.2944 |url=https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html}}</ref><ref>{{cite journal |first=Aaron |last=Klein |title=Fast bayesian optimization of machine learning hyperparameters on large datasets |journal=Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR |year=2017 |pages=528–536 |arxiv=1605.07079 |url=https://proceedings.mlr.press/v54/klein17a.html}}</ref>
}}</ref> that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions.
 
==History==
The term is generally attributed to {{ill|Jonas Mockus|lt}} and is coined in his work from a series of publications on global optimization in the 1970s and 1980s.<ref>{{cite book |first=Jonas |last=Močkus |title=Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974 |chapter=On bayesian methods for seeking the extremum |doi=10.1007/3-540-07165-2_55 |series=Lecture Notes in Computer Science |date=1975 |volume=27 |pages=400–404 |isbn=978-3-540-07165-5 |doi-access=free }}</ref><ref>{{cite journal |first=Jonas |last=Močkus |title=On Bayesian Methods for Seeking the Extremum and their Application |journal=IFIP Congress |year=1977 |pages=195–200 }}</ref><ref name="Mockus1989">{{cite book |first=J. |last=Močkus |title=Bayesian Approach to Global Optimization |publisher=Kluwer Academic |___location=Dordrecht |year=1989 |isbn=0-7923-0115-3 }}</ref>
 
=== Early mathematics foundations ===
 
==== From 1960s to 1980s ====
The earliest idea of Bayesian optimization<ref>{{Cite book |last=GARNETT |first=ROMAN |title=BAYESIAN OPTIMIZATION |date=2023 |publisher=Cambridge University Press |isbn=978-1-108-42578-0 |edition=First published 2023}}</ref> sprang in 1964, from a paper by American applied mathematician Harold J. Kushner,<ref>{{Cite web|url=https://vivo.brown.edu/display/hkushner|title=Kushner, Harold|website=vivo.brown.edu}}</ref> [https://asmedigitalcollection.asme.org/fluidsengineering/article/86/1/97/392213/A-New-Method-of-Locating-the-Maximum-Point-of-an “A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise”]. Although not directly proposing Bayesian optimization, in this paper, he first proposed a new method of locating the maximum point of an arbitrary multipeak curve in a noisy environment. This method provided an important theoretical foundation for subsequent Bayesian optimization.
 
By the 1980s, the framework we now use for Bayesian optimization was explicitly established. In 1978, the Lithuanian scientist Jonas Mockus,<ref>{{Cite web |title=Jonas Mockus |url=https://en.ktu.edu/people/jonas-mockus/ |access-date=2025-03-06 |website=Kaunas University of Technology |language=en}}</ref> in his paper “The Application of Bayesian Methods for Seeking the Extremum”, discussed how to use Bayesian methods to find the extreme value of a function under various uncertain conditions. In his paper, Mockus first proposed the [https://schneppat.com/expected-improvement_ei.html Expected Improvement principle (EI)], which is one of the core sampling strategies of Bayesian optimization. This criterion balances exploration while optimizing the function efficiently by maximizing the expected improvement. Because of the usefulness and profound impact of this principle, Jonas Mockus is widely regarded as the founder of Bayesian optimization. Although Expected Improvement principle (EI) is one of the earliest proposed core sampling strategies for Bayesian optimization, it is not the only one, with the development of modern society, we also have Probability of Improvement (PI), or Upper Confidence Bound (UCB)<ref>{{Cite journal |last1=Kaufmann |first1=Emilie |last2=Cappe |first2=Olivier |last3=Garivier |first3=Aurelien |date=2012-03-21 |title=On Bayesian Upper Confidence Bounds for Bandit Problems |url=https://proceedings.mlr.press/v22/kaufmann12.html |journal=Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics |language=en |publisher=PMLR |pages=592–600}}</ref> and so on.
 
==== From theory to practice ====
In the 1990s, Bayesian optimization began to gradually transition from pure theory to real-world applications. In 1998, Donald R. Jones<ref>{{Cite web |title=Donald R. Jones |url=https://scholar.google.com/citations?user=CZhZ4MYAAAAJ&hl=en |access-date=2025-02-25 |website=scholar.google.com}}</ref> and his coworkers published a paper titled “Efficient Global Optimization of Expensive Black-Box Functions<ref>{{Cite book |last=Jones |first=Donald R. |url=https://link.springer.com/article/10.1023/A:1008306431147 |title=Efficient Global Optimization of Expensive Black-Box Functions |last2=Schonlau |first2=Matthias |last3=Welch |first3=William J. |year=1998}}</ref>”. In this paper, they proposed the Gaussian Process (GP) and elaborated on the Expected Improvement principle (EI) proposed by Jonas Mockus in 1978. Through the efforts of Donald R. Jones and his colleagues, Bayesian Optimization began to shine in the fields like computers science and engineering. However, the computational complexity of Bayesian optimization for the computing power at that time still affected its development to a large extent.
 
In the 21st century, with the gradual rise of artificial intelligence and bionic robots, Bayesian optimization has been widely used in machine learning and deep learning, and has become an important tool for [[Hyperparameter optimization|Hyperparameter Tuning]].<ref>T. T. Joy, S. Rana, S. Gupta and S. Venkatesh, "Hyperparameter tuning for big data using Bayesian optimisation," 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 2016, pp. 2574-2579, doi: 10.1109/ICPR.2016.7900023. keywords: {Big Data;Bayes methods;Optimization;Tuning;Data models;Gaussian processes;Noise measurement},</ref> Companies such as Google, Facebook and OpenAI have added Bayesian optimization to their deep learning frameworks to improve search efficiency. However, Bayesian optimization still faces many challenges, for example, because of the use of Gaussian Process<ref>{{Cite book|title=Neural Networks and Machine Learning|contribution=Introduction to Gaussian processes|first=D. J. C.|last=Mackay|editor-first=C. M.|editor-last=Bishop|contribution-url=https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e045b76dc5daf9f4656ac10b456c5d1d9de5bc84|archive-url=https://web.archive.org/web/20240423144014/https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e045b76dc5daf9f4656ac10b456c5d1d9de5bc84|archive-date=2024-04-23|access-date=2025-03-06|series=NATO ASI Series|volume=168|pages=133–165|year=1998|url-status=live}}</ref> as a proxy model for optimization, when there is a lot of data, the training of Gaussian Process will be very slow and the computational cost is very high. This makes it difficult for this optimization method to work well in more complex drug development and medical experiments.
 
==Strategy==
[[File:GpParBayesAnimationSmall.gif|thumb|440x330px|Bayesian optimization of a function (black) with Gaussian processes (purple). Three acquisition functions (blue) are shown at the bottom.<ref>{{Citation|last=Wilson|first=Samuel|title=ParBayesianOptimization R package|date=2019-11-22|url=https://github.com/AnotherSamWilson/ParBayesianOptimization|access-date=2019-12-12}}</ref>]]
Bayesian optimization is typically used on problems of the form <math display="inline">\max_{x \in AX } f(x)</math>, wherewith <math display="inline">AX</math> isbeing athe set of points,all possible parameters <math display="inline">x</math>, which relytypically uponwith less than or equal to 20 [[dimension]]s for optimal usage (<math display="inline">X \rightarrow \mathbb{R}^d, \mid d \le 20</math>), and whose membership can easily be evaluated. Bayesian optimization is particularly advantageous for problems where <math display="inline">f(x)</math> is difficult to evaluate due to its computational cost. The objective function, <math display="inline">f</math>, is continuous and takes the form of some unknown structure, referred to as a "black box". Upon its evaluation, only <math display="inline">f(x)</math> is observed and its [[derivative]]s are not evaluated.<ref name=":0">{{cite arXiv|last=Frazier|first=Peter I.|date=2018-07-08|title=A Tutorial on Bayesian Optimization|class=stat.ML|eprint=1807.02811}}</ref>
 
Since the objective function is unknown, the Bayesian strategy is to treat it as a random function and place a [[Prior distribution|prior]] over it. The prior captures beliefs about the behavior of the function. After gathering the function evaluations, which are treated as data, the prior is updated to form the [[posterior distribution]] over the objective function. The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines the next query point.
Line 22 ⟶ 34:
There are several methods used to define the prior/posterior distribution over the objective function. The most common two methods use [[Gaussian process]]es in a method called [[kriging]]. Another less expensive method uses the [[Parzen-Tree Estimator]] to construct two distributions for 'high' and 'low' points, and then finds the ___location that maximizes the expected improvement.<ref>J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl: [http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf Algorithms for Hyper-Parameter Optimization]. Advances in Neural Information Processing Systems: 2546–2554 (2011)</ref>
 
Standard Bayesian optimization relies upon each <math>x \in AX</math> being easy to evaluate, and problems that deviate from this assumption are known as ''exotic Bayesian optimization'' problems. Optimization problems can become exotic if it is known that there is noise, the evaluations are being done in parallel, the quality of evaluations relies upon a tradeoff between difficulty and accuracy, the presence of random environmental conditions, or if the evaluation involves derivatives.<ref name=":0" />
 
==Acquisition functions==
Examples of acquisition functions include
* [[probability of improvement]]
* [[expected improvement]]
* [[Bayesian expected losses]]
* [[upper confidence bound]]sbounds (UCB) or [[lower confidence bound]]sbounds
* [[Thompson sampling]]
and hybrids of these.<ref>Matthew W. Hoffman, Eric Brochu, [[Nando de Freitas]]: Portfolio Allocation for Bayesian Optimization. Uncertainty in Artificial Intelligence: 327–336 (2011)</ref> They all trade-off [[Exploration–exploitation dilemma|exploration and exploitation]] so as to minimize the number of function queries. As such, Bayesian optimization is well suited for functions that are expensive to evaluate.
 
==Solution methods==
The maximum of the acquisition function is typically found by resorting to discretization or by means of an auxiliary optimizer. Acquisition functions are typically well-behaved {{Citation needed|reason=doubtful since acquisition functions have many local maxima|date=November 2021}} and are
maximized using a [[Mathematical_optimizationMathematical optimization#Computational_optimization_techniquesComputational optimization techniques|numerical optimization technique]], such as [[Newton's method in optimization|Newton's Methodmethod]] or quasi-Newton methods like the [[Broyden–Fletcher–Goldfarb–Shanno algorithm]].
 
==Applications==
The approach has been applied to solve a wide range of problems,<ref>Eric Brochu, Vlad M. Cora, Nando de Freitas: [https://arxiv.org/abs/1012.2599 A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning]. CoRR abs/1012.2599 (2010)</ref> including [[learning to rank]],<ref>Eric Brochu, Nando de Freitas, Abhijeet Ghosh: [http://papers.nips.cc/paper/3219-active-preference-learning-with-discrete-choice-data.pdf Active Preference Learning with Discrete Choice Data]. Advances in Neural Information Processing Systems: 409-416 (2007)</ref> [[computer graphics]] and visual design,<ref>Eric Brochu, Tyson Brochu, Nando de Freitas: [http://haikufactory.com/files/sca2010.pdf A Bayesian Interactive Optimization Approach to Procedural Animation Design]. Symposium on Computer Animation 2010: 103–112</ref><ref>Yuki Koyama, Issei Sato, Daisuke Sakamoto, Takeo Igarashi: [https://koyama.xyz/project/sequential_line_search/download/preprint.pdf Sequential Line Search for Efficient Visual Design Optimization by Crowds]. ACM Transactions on Graphics, Volume 36, Issue 4, pp.48:1–48:11 (2017). DOI: https://doi.org/10.1145/3072959.3073598</ref><ref>Yuki Koyama, Issei Sato, Masataka Goto: [https://arxiv.org/abs/2005.04107 Sequential Gallery for Interactive Visual Design Optimization]. ACM Transactions on Graphics, Volume 39, Issue 4, pp.88:1–88:12 (2020). DOI: https://doi.org/10.1145/3386569.3392444</ref> [[robotics]],<ref>Daniel J. Lizotte, Tao Wang, Michael H. Bowling, Dale Schuurmans: [https://www.aaai.org/Papers/IJCAI/2007/IJCAI07-152.pdf Automatic Gait Optimization with Gaussian Process Regression] {{Webarchive|url=https://web.archive.org/web/20170812175417/http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-152.pdf |date=2017-08-12 }}. International Joint Conference on Artificial Intelligence: 944–949 (2007)</ref><ref>Ruben Martinez-Cantin, Nando de Freitas, Eric Brochu, Jose Castellanos and Arnaud Doucet. [https://link.springer.com/article/10.1007%2Fs10514-009-9130-2# A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot]. Autonomous Robots. Volume 27, Issue 2, pp 93–103 (2009)</ref><ref>Scott Kuindersma, Roderic Grupen, and Andrew Barto. [http://ijr.sagepub.com/content/32/7/806.abstract# Variable Risk Control via Stochastic Optimization]. International Journal of Robotics Research, volume 32, number 7, pp 806–825 (2013)</ref><ref>Roberto Calandra, André Seyfarth, Jan Peters, and Marc P. Deisenroth [https://link.springer.com/article/10.1007%2Fs10472-015-9463-9 Bayesian optimization for learning gaits under uncertainty]. Ann. Math. Artif. Intell. Volume 76, Issue 1, pp 5-23 (2016) DOI:10.1007/s10472-015-9463-9</ref> [[sensor networks]],<ref>Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias W. Seeger: [https://infoscience.epfl.ch/record/177246/files/srinivas_ieeeit2012.pdf Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting]. IEEE Transactions on Information Theory 58(5):3250–3265 (2012)</ref><ref>Roman Garnett, Michael A. Osborne, Stephen J. Roberts: [http://www.academia.edu/download/30681076/ipsn673-garnett.pdf Bayesian optimization for sensor set selection]{{deadcite linkconference
|date=July 2022|botlast1 =medic}}{{cbignore|bot=medic}}. ACM/IEEEGarnett International| Conferencefirst1 on= InformationRoman
| Processinglast2 in= SensorOsborne Networks:| 209–219first2 (2010)</ref>= automaticMichael algorithm configuration,<ref>Frank Hutter, Holger Hoos, and Kevin Leyton-Brown (2011)A.
| [http://www.cs.ubc.ca/labs/beta/Projects/SMAC/papers/11-LION5-SMAC.pdflast3 Sequential= model-basedRoberts optimization| forfirst3 general= algorithm configuration], Learning and IntelligentStephen Optimization</ref><ref>J.
| Snoek,editor1-last H.= Larochelle,Abdelzaher R.| P. Adams [https://papers.nips.cc/paper/4522editor1-practical-bayesian-optimization-of-machine-learning-algorithms.pdffirst Practical= BayesianTarek Optimization of Machine Learning Algorithms]F.
| Advances in Neural Information Processing Systems: 2951editor2-2959last (2012)</ref>= [[AutomatedVoigt machine learning|automatic machineeditor2-first learning]]= toolboxes,<ref>J. Bergstra, D. Yamins, D. D. Cox (2013).Thiemo
| editor3-last = Wolisz | editor3-first = Adam
| contribution = Bayesian optimization for sensor set selection
| contribution-url = https://scholar.archive.org/work/wfgahthp2vgcnaafcxa3isieom
| doi = 10.1145/1791212.1791238
| pages = 209–219
| publisher = ACM
| title = Proceedings of the 9th International Conference on Information Processing in Sensor Networks, IPSN 2010, April 12–16, 2010, Stockholm, Sweden
| year = 2010| isbn = 978-1-60558-988-6
}}</ref> automatic algorithm configuration,<ref>Frank Hutter, Holger Hoos, and Kevin Leyton-Brown (2011). [http://www.cs.ubc.ca/labs/beta/Projects/SMAC/papers/11-LION5-SMAC.pdf Sequential model-based optimization for general algorithm configuration], Learning and Intelligent Optimization</ref><ref>J. Snoek, H. Larochelle, R. P. Adams [https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf Practical Bayesian Optimization of Machine Learning Algorithms]. Advances in Neural Information Processing Systems: 2951-2959 (2012)</ref> [[Automated machine learning|automatic machine learning]] toolboxes,<ref>J. Bergstra, D. Yamins, D. D. Cox (2013).
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.704.3494&rep=rep1&type=pdf Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms].
Proc. SciPy 2013.</ref><ref>Chris Thornton, Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown: [https://arxiv.org/abs/1208.3719 Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms]. KDD 2013: 847–855</ref><ref>Jasper Snoek, Hugo Larochelle and Ryan Prescott Adams. [https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf Practical Bayesian Optimization of Machine Learning Algorithms]. Advances in Neural Information Processing Systems, 2012</ref> [[reinforcement learning]],<ref>{{Cite thesis |title=Safe Exploration in Reinforcement Learning: Theory and Applications in Robotics |url=https://www.research-collection.ethz.ch/handle/20.500.11850/370833 |publisher=ETH Zurich |date=2019 |degree=Doctoral Thesis |doi=10.3929/ethz-b-000370833 |language=en |first=Felix |last=Berkenkamp|hdl=20.500.11850/370833 }}</ref> planning, visual attention, architecture configuration in [[deep learning]], static program analysis, experimental [[particle physics]],<ref>Philip Ilten, Mike Williams, Yunjie Yang. [https://arxiv.org/abs/1610.08328 Event generator tuning using Bayesian optimization]. 2017 JINST 12 P04028. DOI: 10.1088/1748-0221/12/04/P04028</ref><ref>Evaristo Cisbani et al. [https://iopscience.iop.org/article/10.1088/1748-0221/15/05/P05009 AI-optimized detector design for the future Electron-Ion Collider: the dual-radiator RICH case] 2020 JINST 15 P05009. DOI: 10.1088/1748-0221/15/05/P05009</ref> quality-diversity optimization,<ref>{{Cite arXiv |last1=Kent |first1=Paul |last2=Gaier |first2=Adam |last3=Mouret |first3=Jean-Baptiste |last4=Branke |first4=Juergen |date=2023-07-19 |title=BOP-Elites, a Bayesian Optimisation Approach to Quality Diversity Search with Black-Box descriptor functions |class=math.OC |eprint=2307.09326}} Preprint: Arxiv.</ref><ref>{{Cite book |last1=Kent |first1=Paul |last2=Branke |first2=Juergen |title=Proceedings of the Genetic and Evolutionary Computation Conference |chapter=Bayesian Quality Diversity Search with Interactive Illumination |date=2023-07-12 |chapter-url=https://dl.acm.org/doi/10.1145/3583131.3590486 |series=GECCO '23 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1019–1026 |doi=10.1145/3583131.3590486 |isbn=979-8-4007-0119-1|s2cid=259833672 |url=https://wrap.warwick.ac.uk/175161/7/3583131.3590486.pdf }}</ref><ref>{{Cite journal |last1=Gaier |first1=Adam |last2=Asteroth |first2=Alexander |last3=Mouret |first3=Jean-Baptiste |date=2018-09-01 |title=Data-Efficient Design Exploration through Surrogate-Assisted Illumination |url=http://dx.doi.org/10.1162/evco_a_00231 |journal=Evolutionary Computation |volume=26 |issue=3 |pages=381–410 |doi=10.1162/evco_a_00231 |pmid=29883202 |s2cid=47003986 |issn=1063-6560|doi-access=free |arxiv=1806.05865 }}</ref> chemistry, material design, and drug development.<ref name=":0" /><ref>Gomez-Bombarelli et al. [https://pubs.acs.org/doi/10.1021/acscentsci.7b00572 Automatic Chemical Design using a Data-Driven Continuous Representation of Molecules]. ACS Central Science, Volume 4, Issue 2, 268-276 (2018)</ref><ref>Griffiths et al. [https://pubs.rsc.org/en/content/articlehtml/2020/sc/c9sc04026a Constrained Bayesian Optimization for Automatic Chemical Design using Variational Autoencoders] Chemical Science: 11, 577-586 (2020)</ref>
 
Bayesian Optimizationoptimization has been applied in the field of facial recognition.<ref name=“Bouchene2023”"Bouchene2023">Mohammed Mehdi Bouchene: Bayesian Optimization of Histogram of Oriented Gradients (Hog) Parameters for Facial Recognition. SSRN (2023)</ref>. The performance of the Histogram of Oriented Gradients (HOG) algorithm, a popular feature extraction method, heavily relies on its parameter settings. Optimizing these parameters can be challenging but crucial for achieving high accuracy.<ref name=“Bouchene2023”"Bouchene2023"/>. A novel approach to optimize the HOG algorithm parameters and image size for facial recognition using a Tree-structured Parzen Estimator (TPE) based Bayesian optimization technique has been proposed.<ref name=“Bouchene2023”"Bouchene2023"/>. This optimized approach has the potential to be adapted for other computer vision applications and contributes to the ongoing development of hand-crafted parameter-based feature extraction algorithms in computer vision.<ref name=“Bouchene2023”"Bouchene2023"/>.
 
==See also==
Line 52 ⟶ 78:
* [[Probabilistic numerics]]
* [[Pareto optimum]]
* [[Active learning (machine learning)]]
* [[Multi-objective optimization]]
 
==References==
Line 57 ⟶ 85:
 
{{Optimization algorithms}}
 
[[Category:Sequential methods]]
[[Category:Sequential experiments]]