Frank–Wolfe algorithm: Difference between revisions

Content deleted Content added
Mathmon (talk | contribs)
The direction-finding subproblem and the update rule did not comply with each other. Either x_k +s in D in the subproblem and x_k+1 <-- x_k + \alpha s in the update or s in D in the subproblem and x_k+1 <-- x_k + \alpha (s - x_k) in the update are used. See Jaggi (2013)
 
(45 intermediate revisions by 31 users not shown)
Line 1:
{{Short description|Optimization algorithm}}
The '''Frank–Wolfe algorithm''' is a simplean [[iterative method|iterative]] [[First-order approximation|first-order]] [[Mathematical optimization|optimization]] [[algorithm]] for [[constrained optimization|constrained]] [[convex optimization]]. Also known as the '''conditional gradient method''',<ref>{{Cite doijournal | last1 = Levitin | first1 = E. S. | last2 = Polyak | first2 = B. T. | doi = 10.1016/0041-5553(66)90114-5 |noedit title = Constrained minimization methods | journal = USSR Computational Mathematics and Mathematical Physics | volume = 6 | issue = 5 | pages = 1 | year = 1966 }}</ref> '''reduced gradient algorithm''' and the '''convex combination algorithm''', the method was originally proposed by [[Marguerite Frank]] and [[Philip Wolfe (mathematician)|Philip Wolfe]] in&nbsp;1956.<ref>{{citeCite journal doi| last1 = Frank | first1 = M. | last2 = Wolfe | first2 = P. | doi = 10.1002/nav.3800030109 |noedit title = An algorithm for quadratic programming | journal = Naval Research Logistics Quarterly | volume = 3 | issue = 1–2 | pages = 95–110 | year = 1956 }}</ref> In each iteration, the Frank–Wolfe algorithm considers a [[linear approximation]] of the objective function, and moves slightly towards a minimizer of this linear function (taken over the same ___domain).
 
==Problem statement==
 
Suppose <math>\mathcal{D}</math> is a [[compact space|compact]] [[convex set]] in a [[vector space]] and <math>f \colon \mathcal{D} \to \mathbb{R}</math> is a [[convex function|convex]], [[differentiable function|differentiable]] [[real-valued function]]. The Frank–Wolfe algorithm solves the [[optimization problem]]
:Minimize <math> f(\mathbf{x})</math>
:subject to <math> \mathbf{x} \in \mathcal{D}</math>.
Where the function f is [[Convex function|convex]] and [[differentiable function|differentiable]], and the ___domain / feasible set <math>\mathcal{D}</math> is [[Convex set|convex]] and bounded.
 
==Algorithm==
[[File:Frank-Wolfe- Algorithm.png|thumbnail|right|A step of the Frank-WolfeFrank–Wolfe algorithm]]
 
:''Initialization.:'' Let <math>k \leftarrow 0</math>, and let <math>\mathbf{x}_0 \!</math> be any point in <math>\mathcal{D}</math>.
 
:'''Step 1.''' ''Direction-finding subproblem.:'' Find <math>\mathbf{s}_k</math> solving
::Minimize <math> \mathbf{s}^T, \nabla f(\mathbf{x}_k)</math>
::Subject to <math> \mathbf{s} \in \mathcal{D}</math>
:''(Interpretation: Minimize the linear approximation of the problem given by the first-order [[Taylor series|Taylor approximation]] of <math>f</math> around <math>\mathbf{x}_k \!</math>. constrained to stay within <math>\mathcal{D}</math>.)''
 
:'''Step 2.''' ''Step size determination.:'' Set <math>\gammaalpha \leftarrow \frac{2}{k+2}</math>, or alternatively find <math>\gammaalpha</math> that minimizes <math> f(\mathbf{x}_k+\gammaalpha(\mathbf{s}_k -\mathbf{x}_k))</math> subject to <math>0 \le \gammaalpha \le 1</math> .
 
:'''Step 3.''' ''Update.:'' Let <math>\mathbf{x}_{k+1}\leftarrow \mathbf{x}_k+\gammaalpha(\mathbf{s}_k-\mathbf{x}_k)</math>, let <math>k \leftarrow k+1</math> and go back to Step 1.
 
==Properties==
While competing methods such as [[gradient descent]] for constrained optimization require a [[Projection (mathematics)|projection step]] back to the feasible set in each iteration, the Frank–Wolfe algorithm only needs the solution of a linearconvex problem over the same set in each iteration, and automatically stays in the feasible set.
 
The convergence of the Frank–Wolfe algorithm is sublinear in general: the error in the objective function to the optimum is <math>O(1/k)</math> after ''k'' iterations, so long as the gradient is [[Lipschitz continuity|Lipschitz continuous]] with respect to some norm. The same convergence rate can also be shown if the sub-problems are only solved approximately.<ref>{{citeCite journal doi| last1 = Dunn | first1 = J. C. | last2 = Harshbarger | first2 = S. | doi = 10.1016/0022-247X(78)90137-3 |noedit title = Conditional gradient algorithms with open loop step size rules | journal = Journal of Mathematical Analysis and Applications | volume = 62 | issue = 2 | pages = 432 | year = 1978 | doi-access = free }}</ref>
 
The iteratesiterations of the algorithm can always be represented as a sparse convex combination of the extreme points of the feasible set, which has helped to the popularity of the algorithm for sparse greedy optimization in [[machine learning]] and [[signal processing]] problems,<ref>{{citeCite journal doi| last1 = Clarkson | first1 = K. L. | title = Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm | doi = 10.1145/1824777.1824783 |noedit journal = ACM Transactions on Algorithms | volume = 6 | issue = 4 | pages = 1–30 | year = 2010 | citeseerx = 10.1.1.145.9299 }}</ref> as well as for example the optimization of [[flow network|minimum–cost flow]]s in [[Transport network|transportation network]]s.<ref>{{citeCite journal doi| last1 = Fukushima | first1 = M. | title = A modified Frank-Wolfe algorithm for solving the traffic assignment problem | doi = 10.1016/0191-2615(84)90029-8 |noedit journal = Transportation Research Part B: Methodological | volume = 18 | issue = 2 | pages = 169–177| year = 1984 }}</ref>
 
If the feasible set is given by a set of linear constraints, then the subproblem to be solved in each iteration becomes a [[linear programming|linear program]].
 
While the worst-case convergence rate with <math>O(1/k)</math> can not be improved in general, faster convergence can be obtained for special problem classes, such as some strongly convex problems.<ref>{{Cite book|title=Nonlinear Programming|first= Dimitri |last=Bertsekas|year= 20031999|page= 222215|publisher= Athena Scientific| isbn =978-1-886529-00-07}}</ref>
 
==Lower bounds on the solution value, and primal-dual analysis==
 
Since <math>f</math> is convex, <math>f(\mathbf{y})</math> is always above the [[TangentConvex function|tangent planeconvex]], offor <math>f</math>any at anytwo pointpoints <math>\mathbf{x}, \mathbf{y} \in \mathcal{D}</math> we have:
 
:<math>
Line 40 ⟶ 41:
</math>
 
This holdsalso in particularholds for the (unknown) optimal solution <math>\mathbf{x}^*</math>. That is, <math>f(\mathbf{x}^*) \ge f(\mathbf{x}) + (\mathbf{x}^* - \mathbf{x})^T \nabla f(\mathbf{x})</math>. The best lower bound with respect to a given point <math>\mathbf{x}</math> is given by
 
:<math>
\begin{align}
f(\mathbf{x}^*) \geq \min_{\mathbf{x} \in D} f(\mathbf{x}) + (\mathbf{y} - \mathbf{x})^T \nabla f(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}^T \nabla f(\mathbf{x}) + \min_{\mathbf{y} \in D} \mathbf{y}^T \nabla f(\mathbf{x})
f(\mathbf{x}^*)
& \ge f(\mathbf{x}) + (\mathbf{x}^* - \mathbf{x})^T \nabla f(\mathbf{x}) \\
&\geq \min_{\mathbf{y} \in D} \left\{ f(\mathbf{x}) + (\mathbf{y} - \mathbf{x})^T \nabla f(\mathbf{x}) \right\} \\
f(\mathbf{x}^*) \geq \min_{\mathbf{x} \in D} f(\mathbf{x}) + (\mathbf{y} - \mathbf{x})^T \nabla f(\mathbf{x}) &= f(\mathbf{x}) - \mathbf{x}^T \nabla f(\mathbf{x}) + \min_{\mathbf{y} \in D} \mathbf{y}^T \nabla f(\mathbf{x})
\end{align}
</math>
 
The latter optimization problem is solved in every iteration of the Frank-WolfeFrank–Wolfe algorithm, therefore the solution <math>\mathbf{s}_k</math> of the direction-finding subproblem of the <math>k</math>-th iteration can be used to determine increasing lower bounds <math>u_kl_k</math> during each iteration by setting <math>u_0l_0 = - \infty</math> and
 
:<math>
u_kl_k := \max (u_l_{k - 1}, f(\mathbf{x}_k) + (\mathbf{s}_k - \mathbf{x}_k)^T \nabla f(\mathbf{x}_k))
</math>
Such lower bounds on the unknown optimal value are important in practice because they can be used as a stopping criterion, and give an efficient certificate of the approximation quality in every iteration, since always <math>l_k \leq f(\mathbf{x}^*) \leq f(\mathbf{x}_k)</math>.
 
It has been shown that this corresponding [[duality gap]], that is the difference between <math>f(\mathbf{x}_k)</math> and the lower bound <math>l_k</math>, decreases with the same convergence rate, i.e.
<math>
f(\mathbf{x}_k) - l_k = O(1/k) .
</math>
 
Line 56 ⟶ 68:
 
==Bibliography==
*{{cite journal|last=Jaggi|first=Martin|title=Revisiting Frank-WolfeFrank–Wolfe: Projection-Free Sparse Convex Optimization|journal=Journal of Machine Learning Research: Workshop and Conference Proceedings |volume=28|issue=1|pages=427–435|year= 2013 |url=http://jmlr.csail.mit.edu/proceedings/papers/v28/jaggi13.html}} (Overview paper)
*[http://www.math.chalmers.se/Math/Grundutb/CTH/tma946/0203/fw_eng.pdf The Frank-WolfeFrank–Wolfe algorithm] description
* {{Cite book | last1=Nocedal | first1=Jorge | last2=Wright | first2=Stephen J. | title=Numerical Optimization | publisher=[[Springer-Verlag]] | ___location=Berlin, New York | edition=2nd | isbn=978-0-387-30303-1 | year=2006 }}.
 
==External links==
*https://conditional-gradients.org/: a survey of Frank–Wolfe algorithms.
*[https://www.youtube.com/watch?v=24e08AX9Eww Marguerite Frank giving a personal account of the history of the algorithm]
 
== See also ==
* [[Proximal Gradientgradient Methodsmethods]]
 
{{Optimization algorithms|convex}}