Cross-entropy method: Difference between revisions

Content deleted Content added
m moved Cross entropy method to Cross-entropy method: correct spelling, and the name used in article
Software implementations: Add CEopt Matlab package to the list.
 
(79 intermediate revisions by 52 users not shown)
Line 1:
{{Short description|Monte Carlo method for importance sampling and optimization}}
The '''Cross-Entropy (CE) method''' attributed to Reuven Rubinstein is a general [[Monte_Carlo_method|Monte Carlo]] approach to
The '''cross-entropy''' ('''CE''') '''method''' is a [[Monte Carlo method|Monte Carlo]] method for [[importance sampling]] and [[Optimization (mathematics)|optimization]]. It is applicable to both [[Combinatorial optimization|combinatorial]] and [[Continuous optimization|continuous]] problems, with either a static or noisy objective.
[[Combinatorial_optimization|combinatorial]] and [[Continuous_optimization|continuous]] multi-extremal [[Optimization_(mathematics)|optimization]] and [[importance sampling]].
The method originated from the field of ''rare event simulation'', where
very small probabilities need to be accurately estimated, for example in network reliability analysis, queueing models, or performance analysis of telecommunication systems.
The CE method can be applied to static and noisy combinatorial optimization problems such as the [[Traveling_salesman_problem|traveling salesman problem]], the [[Quadratic_assignment_problem|quadratic assignment problem]], [[Sequence_alignment|DNA sequence alignment]], the [[Maxcut|max-cut]] problem and the buffer allocation problem, as well as continuous [[Global_optimization|global optimization]] problems with many local [[Extrema|extrema]].
 
*The method approximates the optimal importance sampling estimator by repeating two phases:<ref>Rubinstein, R.Y., and Kroese, D.P. (2004)., ''The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning'', Springer-Verlag, New York {{ISBN|978-0-387-21240-1}}.</ref>
In a nutshell the CE method consists of two phases:
 
#Draw a sample from a [[probability distribution]].
#Generate a random data sample (trajectories, vectors, etc.) according to a specified mechanism.
#UpdateMinimize the parameters''[[cross-entropy]]'' ofbetween thethis randomdistribution mechanismand baseda ontarget the datadistribution to produce a "better" sample in the next iteration. This step involves minimizing the ''Cross Entropy'' or [[Kullback-Leibler divergence|Kullback-Leibler]] divergence.
 
[[Reuven Rubinstein]] developed the method in the context of ''rare-event simulation'', where tiny probabilities must be estimated, for example in network reliability analysis, queueing models, or performance analysis of telecommunication systems. The method has also been applied to the [[traveling salesman problem|traveling salesman]], [[quadratic assignment problem|quadratic assignment]], [[Sequence alignment|DNA sequence alignment]], [[Maxcut|max-cut]] and buffer allocation problems.
===Estimation via Importance Sampling===
Consider the general problem of estimating the quantity <math>\ell = \mathbb{E}_{\mathbf{u}}[H(\mathbf{X})] = \int H(\mathbf{x})\, f(\mathbf{x}; \mathbf{u})\, \textrm{d}\mathbf{x}</math>, where <math>H</math> is some ''performance function'' and <math>f(\mathbf{x};\mathbf{u})</math> is a member of some parametric family of distributions. Using [[importance sampling]] this quantity can be estimated as <math>\hat{\ell} = \frac{1}{N} \sum_{i=1}^N H(\mathbf{X}_i) \frac{f(\mathbf{X}_i; \mathbf{u})}{g(\mathbf{X}_i)}</math>, where <math>\mathbf{X}_1,\dots,\mathbf{X}_N</math> is a random sample from <math>g\,</math>. For positive <math>H</math>, the theoretically ''optimal'' importance sampling [[probability density function|density]] (pdf)is given by
<math> g^*(\mathbf{x}) = H(\mathbf{x}) f(\mathbf{x};\mathbf{u})/\ell</math>. This, however, depends on the unknown <math>\ell</math>. The CE method aims to approximate the optimal pdf by adaptively selecting members of the parametric family that are closest (in the [[Kullback-Leibler divergence|Kullback-Leibler]] sense) to the optimal pdf <math>g^*</math>.
 
===Estimation via Importanceimportance Sampling=sampling==
===Generic CE Algorithm===
Consider the general problem of estimating the quantity
1. Choose initial parameter vector <math>\mathbf{v}^{(0)}</math>; set t = 1.
2. Generate a random sample <math>\mathbf{X}_1,\dots,\mathbf{X}_N</math> from <math>f(\cdot;\mathbf{v}^{(t-1)})</math>
3. Solve for <math>\mathbf{v}^{(t)}</math>, where
<math>\mathbf{v}^{(t)} = \mathop{\textrm{argmax}}_{\mathbf{v}} \frac{1}{N} \sum_{i=1}^N H(\mathbf{X}_i) \frac{f(\mathbf{X}_i;\mathbf{u})}{f(\mathbf{X}_i;\mathbf{v}^{(t-1)})} \log f(\mathbf{X}_i;\mathbf{v})</math>
4. If convergence is reached then '''stop'''; otherwise, increase t by 1 and reiterate from step 2.
 
<math>\ell = \mathbb{E}_{\mathbf{u}}[H(\mathbf{X})] = \int H(\mathbf{x})\, f(\mathbf{x}; \mathbf{u})\, \textrm{d}\mathbf{x}</math>,
In several cases, the solution to step 3 can be found ''analytically''. Situations in which this occurs are
* When <math>f\,</math> belongs to the [[Exponential_family|natural exponential family]]
* When <math>f\,</math> is [[discrete]] with finite [[Support (mathematics)|support]]
* When <math>H(\mathbf{X}) = \mathrm{I}_{\{\mathbf{x}\in A\}}</math> and <math>f(\mathbf{X}_i;\mathbf{u}) = f(\mathbf{X}_i;\mathbf{v}^{(t-1)})</math>, then <math>\mathbf{v}^{(t)}</math> corresponds to the [[Maximum_likelihood|Maximum Likelihood Estimator]] based on those <math>\mathbf{X}_k \in A</math>.
 
where <math>H</math> is some ''performance function'' and <math>f(\mathbf{x};\mathbf{u})</math> is a member of some [[parametric family]] of distributions. Using [[importance sampling]] this quantity can be estimated as
 
<math>\hat{\ell} = \frac{1}{N} \sum_{i=1}^N H(\mathbf{X}_i) \frac{f(\mathbf{X}_i; \mathbf{u})}{g(\mathbf{X}_i)}</math>,
=== Continuous Optimization - Example===
 
where <math>\mathbf{X}_1,\dots,\mathbf{X}_N</math> is a random sample from <math>g\,</math>. For positive <math>H</math>, the theoretically ''optimal'' importance sampling [[probability density function|density]] (PDF) is given by
 
<math> g^*(\mathbf{x}) = H(\mathbf{x}) f(\mathbf{x};\mathbf{u})/\ell</math>.
 
<math> g^*(\mathbf{x}) = H(\mathbf{x}) f(\mathbf{x};\mathbf{u})/\ell</math>. This, however, depends on the unknown <math>\ell</math>. The CE method aims to approximate the optimal pdfPDF by adaptively selecting members of the parametric family that are closest (in the [[Kullback-LeiblerKullback–Leibler divergence|Kullback-LeiblerKullback–Leibler]] sense) to the optimal pdfPDF <math>g^*</math>.
 
===Generic CE Algorithm=algorithm==
1.# Choose initial parameter vector <math>\mathbf{v}^{(0)}</math>; set t = 1.
2.# Generate a random sample <math>\mathbf{X}_1,\dots,\mathbf{X}_N</math> from <math>f(\cdot;\mathbf{v}^{(t-1)})</math>
# Solve for <math>\mathbf{v}^{(t)}</math>, where<br><math>\mathbf{v}^{(t)} = \mathop{\textrm{argmax}}_{\mathbf{v}} \frac{1}{N} \sum_{i=1}^N H(\mathbf{X}_i) \frac{f(\mathbf{X}_i;\mathbf{u})}{f(\mathbf{X}_i;\mathbf{v}^{(t-1)})} \log f(\mathbf{X}_i;\mathbf{v})</math>
4.# If convergence is reached then '''stop'''; otherwise, increase t by 1 and reiterate from step 2.
 
In several cases, the solution to step 3 can be found ''analytically''. Situations in which this occurs are
* When <math>f\,</math> belongs to the [[Exponential_familyExponential family|natural exponential family]]
* When <math>f\,</math> is [[discrete space|discrete]] with finite [[Support (mathematics)|support]]
* When <math>H(\mathbf{X}) = \mathrm{I}_{\{\mathbf{x}\in A\}}</math> and <math>f(\mathbf{X}_i;\mathbf{u}) = f(\mathbf{X}_i;\mathbf{v}^{(t-1)})</math>, then <math>\mathbf{v}^{(t)}</math> corresponds to the [[Maximum_likelihood|Maximum Likelihoodlikelihood|maximum likelihood Estimatorestimator]] based on those <math>\mathbf{X}_k \in A</math>.
 
== Continuous optimization&mdash;example==
The same CE algorithm can be used for optimization, rather than estimation.
Suppose the problem is to maximize some function <math>S(x)</math>, for example,
<math>S(x) = \textrm{e}^{-(x-2)^2} + 0.8\,\textrm{e}^{-(x+2)^2}</math>.
To apply CE, one considers first the ''associated stochastic problem'' of estimating
Line 43 ⟶ 50:
parametric family are the sample mean and sample variance corresponding to the ''elite samples'', which are those samples that have objective function value <math>\geq\gamma</math>.
The worst of the elite samples is then used as the level parameter for the next iteration.
This yields the following randomized algorithm forthat thishappens problemto coincide with the so-called Estimation of Multivariate Normal Algorithm (EMNA), an [[estimation of distribution algorithm]].
 
===Pseudocode===
====Pseudo-code====
1. mu:=-6; sigma2:=100; t:=0; maxits=100; ''// Initialize parameters''
&mu; := −6
2. N:=100; Ne:=10; //
&sigma;2 := 100
3. while t < maxits and sigma2 > epsilon // While not converged and maxits not exceeded
t := 0
4. X = SampleGaussian(mu,sigma2,N); // Obtain N samples from current sampling distribution
maxits := 100
5. S = exp(-(X-2)^2) + 0.8 exp(-(X+2)^2); // Evaluate objective function at sampled points
N := 100
6. X = sort(X,S); // Sort X by objective function values (in descending order)
Ne := 10
7. mu = mean(X(1:Ne)); sigma2=var(X(1:Ne)); // Update parameters of sampling distribution
''// While maxits not exceeded and not converged''
8. t = t+1; // Increment iteration counter
'''while''' t < maxits '''and''' &sigma;2 > &epsilon; '''do'''
9. return mu // Return mean of final sampling distribution as solution
4. X = SampleGaussian(mu,sigma2,N); ''// Obtain N samples from current sampling distribution''
X := SampleGaussian(&mu;, &sigma;2, N)
5. S = exp(-(X-2)^2) + 0.8 exp(-(X+2)^2); ''// Evaluate objective function at sampled points''
S := exp(−(X − 2) ^ 2) + 0.8 exp(−(X + 2) ^ 2)
6. X = sort(X,S); ''// Sort X by objective function values (in descending order)''
X := sort(X, S)
''// Update parameters of sampling distribution via elite samples''
&mu; := mean(X(1:Ne))
&sigma;2 := var(X(1:Ne))
t := t + 1
9. return mu ''// Return mean of final sampling distribution as solution''
'''return''' &mu;
 
==Related methods==
* [[Simulated annealing]]
* [[Genetic algorithms]]
* [[TabuHarmony search]]
* [[Estimation of distribution algorithm]]
* [[Tabu search]]
* [[Natural Evolution Strategy]]
* [[Ant colony optimization algorithms]]
 
==See also==
* [[Cross entropy]]
* [[Kullback-LeiblerKullback–Leibler divergence]]
* [[Randomized algorithm]]
* [[Importance sampling]]
 
== Journal papers ==
==References==
* De Boer, P.-T., Kroese, D.P., Mannor, S. and Rubinstein, R.Y. (2005). A Tutorial on the Cross-Entropy Method. ''Annals of Operations Research'', '''134''' (1), 19--6719–67.[http://www.maths.uq.edu.au/~kroese/ps/aortut.pdf]
*Rubinstein, R.Y. (1997). Optimization of Computer simulationSimulation Models with Rare Events, ''European Journal of OperationsOperational Research'', '''99''', 89-11289–112.
*Rubinstein, R.Y., Kroese, D.P. (2004). ''The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning'', Springer-Verlag, New York.
 
==Software implementations==
==External links==
* [https://ceopt.org '''CEopt''' Matlab package]
*[http://www.cemethod.org/ Homepage for the CE method]
* [https://cran.r-project.org/web/packages/CEoptim/index.html '''CEoptim''' R package]
* [https://www.nuget.org/packages/Novacta.Analytics '''Novacta.Analytics''' .NET library]
 
==References==
{{reflist}}
 
[[Category:Heuristics]]
[[Category:Optimization algorithms and methods]]
[[Category:Monte Carlo methods]]
[[Category:Machine learning]]