Frank–Wolfe algorithm: Difference between revisions

Content deleted Content added
Isheden (talk | contribs)
Original reference improved
The direction-finding subproblem and the update rule did not comply with each other. Either x_k +s in D in the subproblem and x_k+1 <-- x_k + \alpha s in the update or s in D in the subproblem and x_k+1 <-- x_k + \alpha (s - x_k) in the update are used. See Jaggi (2013)
 
(65 intermediate revisions by 42 users not shown)
Line 1:
{{Short description|Optimization algorithm}}
In [[mathematical optimization]], the '''reduced gradient method''' of Frank and Wolfe is an [[iterative method]] for [[nonlinear programming]]. Also known as the '''Frank–Wolfe algorithm''' and the ''convex combination algorithm'', the reduced gradient method was proposed by Marguerite Frank and [[Phil Wolfe]] in&nbsp;1956<ref>{{cite journal |last1=Frank |first1=M. |last2=Wolfe |first2=P. |title=An algorithm for quadratic programming |url=http://onlinelibrary.wiley.com/doi/10.1002/nav.3800030109/pdf |journal=Naval Research Logistics Quarterly |volume=3 |number=1-2 |year=1956 |pages=95–110|doi=10.1002/nav.3800030109}}</ref> as an [[algorithm]] for [[quadratic programming]]. In [[simplex algorithm|phase one]], the reduced gradient method finds a [[linear programming#feasible solution|feasible solution]] to the [[linear programming#feasible region|linear constraints]], if one exists. Thereafter, at each iteration, the method takes a [[gradient descent|descent step]] in the [[gradient descent|negative gradient direction]], so reducing the [[objective function]]; this [[gradient descent]] step is "reduced" to remain in the [[polyhedron|polyhedral]] feasible region of the linear constraints. Because quadratic programming is a generalization of linear programming, the reduced gradient method is a generalization of Dantzig's [[simplex algorithm]] for [[linear programming]].
The '''Frank–Wolfe algorithm''' is an [[iterative method|iterative]] [[First-order approximation|first-order]] [[Mathematical optimization|optimization]] [[algorithm]] for [[constrained optimization|constrained]] [[convex optimization]]. Also known as the '''conditional gradient method''',<ref>{{Cite journal | last1 = Levitin | first1 = E. S. | last2 = Polyak | first2 = B. T. | doi = 10.1016/0041-5553(66)90114-5 | title = Constrained minimization methods | journal = USSR Computational Mathematics and Mathematical Physics | volume = 6 | issue = 5 | pages = 1 | year = 1966 }}</ref> '''reduced gradient algorithm''' and the '''convex combination algorithm''', the method was originally proposed by [[Marguerite Frank]] and [[Philip Wolfe (mathematician)|Philip Wolfe]] in&nbsp;1956.<ref>{{Cite journal | last1 = Frank | first1 = M. | last2 = Wolfe | first2 = P. | doi = 10.1002/nav.3800030109 | title = An algorithm for quadratic programming | journal = Naval Research Logistics Quarterly | volume = 3 | issue = 1–2 | pages = 95–110 | year = 1956 }}</ref> In each iteration, the Frank–Wolfe algorithm considers a [[linear approximation]] of the objective function, and moves towards a minimizer of this linear function (taken over the same ___domain).
 
The reduced gradient method is an iterative method for [[nonlinear programming]], a method that need not be restricted to quadratic programming. While the method is slower than competing methods and has been abandoned as a general purpose method of nonlinear programming, it remains widely used for specially structured problems of large scale optimization. In particular, the reduced gradient method remains popular and effective for finding approximate [[flow network|minimum–cost flow]]s in [[transportation network]]s, which often have enormous size.
 
==Problem statement==
 
Suppose <math>\mathcal{D}</math> is a [[compact space|compact]] [[convex set]] in a [[vector space]] and <math>f \colon \mathcal{D} \to \mathbb{R}</math> is a [[convex function|convex]], [[differentiable function|differentiable]] [[real-valued function]]. The Frank–Wolfe algorithm solves the [[optimization problem]]
:Minimize <math> f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^{\mathrm{T}} E\mathbf{x} + \mathbf{h}^{\mathrm{T}} \mathbf{x} </math>
:subject toMinimize <math> f(\mathbf{x} \epsilon \mathbf{P})</math>.
:Subjectsubject to <math> \barmathbf{x}_k \epsilonin \mathbfmathcal{PD}</math>.
Where the n×n matrix E is [[Positive-semidefinite matrix|positive semidefinite]], h is an n×1 vector, and <math>\mathbf{P}</math> represents a feasible region defined by a mix of linear inequality and equality constraints (for example Ax ≤ b, Cx = d).
 
==Algorithm==
[[File:Frank-Wolfe Algorithm.png|thumbnail|right|A step of the Frank–Wolfe algorithm]]
 
:''Initialization:'' Let <math>k \leftarrow 0</math>, and let <math>\mathbf{x}_0 \!</math> be any point in <math>\mathcal{D}</math>.
 
:'''Step 1.''' Initialization. Let''Direction-finding <math>ksubproblem:'' \leftarrow 0</math> and let <math>x_k \!</math> be any point inFind <math>\mathbf{Ps}_k</math>. solving
::Minimize <math>f(x_k) + \nablamathbf{s}^T \nabla f(x_k) \barmathbf{x}_k)</math>
::Subject to <math> \mathbf{s} \in \mathcal{D}</math>
:''(Interpretation: Minimize the linear approximation of the problem given by the first-order [[Taylor series|Taylor approximation]] of <math>f</math> around <math>\mathbf{x}_k \!</math> constrained to stay within <math>\mathcal{D}</math>.)''
 
:'''Step 2.''' Convergence''Step test.size determination:'' IfSet <math>\alpha \nablaleftarrow f(\mathbf{x})=\frac{12}{k+2}</math>, or alternatively find <math>\alpha</math> that minimizes <math> f(E+E^T)\mathbf{x}_k+\alpha(\mathbf{hs}=_k -\mathbf{0x}_k))</math> thensubject Stop,to we<math>0 have\le found\alpha the\le 1</math> minimum.
 
:'''Step 53.''' ''Update.:'' Let <math>x_\mathbf{x}_{k+1}\leftarrow x_k\mathbf{x}_k+\lambdaalpha(\barmathbf{xs}_k-x_k\mathbf{x}_k)</math>, let <math>k \leftarrow k+1</math> and go back to Step 21.
'''Step 3.''' Direction-finding subproblem. The approximation of the problem that is obtained by replacing the function f with its first-order [[Taylor series|Taylor expansion]] around <math>x_k \!</math> is found. Solve for <math>\bar{x}_k</math>:
:Minimize <math>f(x_k) + \nabla^T f(x_k) \bar{x}_k</math>
:Subject to <math>\bar{x}_k \epsilon \mathbf{P}</math>
:(note that this is a Linear Program. <math>x_k \!</math> is fixed during Step 3, while the minimization takes place by varying <math>\bar{x}_k</math> and is equivalent to minimization of <math>\nabla^T f(x_k) \bar{x}_k</math>).
 
==Properties==
'''Step 4.''' Step size determination. Find <math>\lambda \!</math> that minimizes <math> f(x_k+\lambda(\bar{x}_k-x_k))</math> subject to <math>0 \le \lambda \le 1</math> . If <math>\nabla f(x_k)^T(\bar{x}_k-x_k) \ge 0 \!</math> then Stop, we have found the minimum in <math> x_k\!</math>.
While competing methods such as [[gradient descent]] for constrained optimization require a [[Projection (mathematics)|projection step]] back to the feasible set in each iteration, the Frank–Wolfe algorithm only needs the solution of a convex problem over the same set in each iteration, and automatically stays in the feasible set.
 
The convergence of the Frank–Wolfe algorithm is sublinear in general: the error in the objective function to the optimum is <math>O(1/k)</math> after ''k'' iterations, so long as the gradient is [[Lipschitz continuity|Lipschitz continuous]] with respect to some norm. The same convergence rate can also be shown if the sub-problems are only solved approximately.<ref>{{Cite journal | last1 = Dunn | first1 = J. C. | last2 = Harshbarger | first2 = S. | doi = 10.1016/0022-247X(78)90137-3 | title = Conditional gradient algorithms with open loop step size rules | journal = Journal of Mathematical Analysis and Applications | volume = 62 | issue = 2 | pages = 432 | year = 1978 | doi-access = free }}</ref>
'''Step 5.''' Update. Let <math>x_{k+1}\leftarrow x_k+\lambda(\bar{x}_k-x_k)</math>, let <math>k \leftarrow k+1</math> and go back to Step 2.
 
The iterations of the algorithm can always be represented as a sparse convex combination of the extreme points of the feasible set, which has helped to the popularity of the algorithm for sparse greedy optimization in [[machine learning]] and [[signal processing]] problems,<ref>{{Cite journal | last1 = Clarkson | first1 = K. L. | title = Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm | doi = 10.1145/1824777.1824783 | journal = ACM Transactions on Algorithms | volume = 6 | issue = 4 | pages = 1–30 | year = 2010 | citeseerx = 10.1.1.145.9299 }}</ref> as well as for example the optimization of [[flow network|minimum–cost flow]]s in [[Transport network|transportation network]]s.<ref>{{Cite journal | last1 = Fukushima | first1 = M. | title = A modified Frank-Wolfe algorithm for solving the traffic assignment problem | doi = 10.1016/0191-2615(84)90029-8 | journal = Transportation Research Part B: Methodological | volume = 18 | issue = 2 | pages = 169–177| year = 1984 }}</ref>
==Comments==
 
If the feasible set is given by a set of linear constraints, then the subproblem to be solved in each iteration becomes a [[linear programming|linear program]].
 
TheWhile algorithmthe generallyworst-case makesconvergence goodrate progresswith towards<math>O(1/k)</math> thecan optimumnot duringbe theimproved first fewin iterationsgeneral, butfaster convergence often slows down substantially when close to the minimum point. For this reason the algorithm is perhaps best used to find an approximate solution. It can be shownobtained thatfor thespecial worstproblem case convergence rate is [[sublinear]]; howeverclasses, insuch practiceas the convergence rate has been observed to improve in casesome ofstrongly manyconvex constraintsproblems.<ref>"{{Cite book|title=Nonlinear Programming",|first= Dimitri |last=Bertsekas, 2003,|year= 1999|page= 222.215|publisher= Athena Scientific,| ISBNisbn =978-1-886529-00-0.7}}</ref>
 
==Lower bounds on the solution value, and primal-dual analysis==
==References==
 
Since <math>f</math> is [[Convex function|convex]], for any two points <math>\mathbf{x}, \mathbf{y} \in \mathcal{D}</math> we have:
*[http://www.math.chalmers.se/Math/Grundutb/CTH/tma946/0203/fw_eng.pdf The Frank-Wolfe algorithm] description
*[http://swutc.tamu.edu/publications/technicalreports/472840-00074-1.pdf Combined Traffic Signal Control and Traffic Assignment: Algorithms, Implementation and Numerical Results], Chungwon Lee and [[Randy B. Machemehl]], University of Texas at Austin, March 2005
 
:<math>
== Notes ==
:Minimize <math> f(\mathbf{xy}) = \frac{1}{2}geq f(\mathbf{x}^{\mathrm{T}}) + E(\mathbf{xy} + - \mathbf{hx})^{\mathrm{T}} \nabla f(\mathbf{x} </math>)
<references/>
</math>
 
This also holds for the (unknown) optimal solution <math>\mathbf{x}^*</math>. That is, <math>f(\mathbf{x}^*) \ge f(\mathbf{x}) + (\mathbf{x}^* - \mathbf{x})^T \nabla f(\mathbf{x})</math>. The best lower bound with respect to a given point <math>\mathbf{x}</math> is given by
[[Category:Optimization methods]]
[[Category:Iterative methods]]
 
:<math>
[[nl:Frank–Wolfe algoritme]]
\begin{align}
f(\mathbf{x}^*)
& \ge f(\mathbf{x}) + (\mathbf{x}^* - \mathbf{x})^T \nabla f(\mathbf{x}) \\
&\geq \min_{\mathbf{y} \in D} \left\{ f(\mathbf{x}) + (\mathbf{y} - \mathbf{x})^T \nabla f(\mathbf{x}) \right\} \\
&= f(\mathbf{x}) - \mathbf{x}^T \nabla f(\mathbf{x}) + \min_{\mathbf{y} \in D} \mathbf{y}^T \nabla f(\mathbf{x})
\end{align}
</math>
 
The latter optimization problem is solved in every iteration of the Frank–Wolfe algorithm, therefore the solution <math>\mathbf{s}_k</math> of the direction-finding subproblem of the <math>k</math>-th iteration can be used to determine increasing lower bounds <math>l_k</math> during each iteration by setting <math>l_0 = - \infty</math> and
 
:<math>
l_k := \max (l_{k - 1}, f(\mathbf{x}_k) + (\mathbf{s}_k - \mathbf{x}_k)^T \nabla f(\mathbf{x}_k))
</math>
Such lower bounds on the unknown optimal value are important in practice because they can be used as a stopping criterion, and give an efficient certificate of the approximation quality in every iteration, since always <math>l_k \leq f(\mathbf{x}^*) \leq f(\mathbf{x}_k)</math>.
 
It has been shown that this corresponding [[duality gap]], that is the difference between <math>f(\mathbf{x}_k)</math> and the lower bound <math>l_k</math>, decreases with the same convergence rate, i.e.
<math>
f(\mathbf{x}_k) - l_k = O(1/k) .
</math>
 
== Notes ==
{{Reflist}}
 
==Bibliography==
*{{cite journal|last=Jaggi|first=Martin|title=Revisiting Frank–Wolfe: Projection-Free Sparse Convex Optimization|journal=Journal of Machine Learning Research: Workshop and Conference Proceedings |volume=28|issue=1|pages=427–435|year= 2013 |url=http://jmlr.csail.mit.edu/proceedings/papers/v28/jaggi13.html}} (Overview paper)
*[http://www.math.chalmers.se/Math/Grundutb/CTH/tma946/0203/fw_eng.pdf The Frank-WolfeFrank–Wolfe algorithm] description
* {{Cite book | last1=Nocedal | first1=Jorge | last2=Wright | first2=Stephen J. | title=Numerical Optimization | publisher=[[Springer-Verlag]] | ___location=Berlin, New York | edition=2nd | isbn=978-0-387-30303-1 | year=2006 }}.
 
==External links==
*https://conditional-gradients.org/: a survey of Frank–Wolfe algorithms.
*[https://www.youtube.com/watch?v=24e08AX9Eww Marguerite Frank giving a personal account of the history of the algorithm]
 
== See also ==
* [[Proximal gradient methods]]
 
{{Optimization algorithms|convex}}
 
{{DEFAULTSORT:Frank-Wolfe algorithm}}
[[Category:Optimization algorithms and methods]]
[[Category:Iterative methods]]
[[Category:First order methods]]
[[Category:Gradient methods]]