Subgradient method: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:34, 14 October 2010 edit Kiefer.Wolfowitz (talk \| contribs) 39,688 edits →Subgradient-projection methods: Put related things together (Orwell's rule) ← Previous edit		Latest revision as of 20:07, 23 February 2025 edit undo 134.10.136.81 (talk) updated outdated term Tags: Mobile edit Mobile app edit iOS app edit App section source
(61 intermediate revisions by 34 users not shown)
Line 1: {{Short description\|Concept in convex optimization mathematics}} '''Subgradient methods''' are [[iterative method]]s for solving [[convex optimization\|convex minimization]] problems. Originally developed by [[Naum Z. Shor]] and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, subgradient methods for unconstrained problems use the same search direction as the method of [[gradient descent\|steepest descent]]. '''Subgradient methods''' are [[convex optimization]] methods which use [[Subderivative\|subderivatives]]. Originally developed by [[Naum Z. Shor]] and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, sub-gradient methods for unconstrained problems use the same search direction as the method of [[gradient descent\|gradient descent]]. Subgradient methods are slower than Newton's method when applied to minimize twice continuously differentiable convex functions. However, Newton's method fails to converge on problems that have non-differentiable kinks. In recent years, some [[interior-point methods]] have been suggested for convex minimization problems, but subgradient projection methods and related bundle methods of descent remain competitive. For convex minimization problems with ~~enormomous~~very large number of dimensions, subgradient-projection methods are suitable, because of they require little storage. Subgradient projection methods are often applied to large-scale problems with decomposition techniques. Such decomposition methods often allow a simple distributed method for a problem. Line 9 ⟶ 10: ==Classical subgradient rules== Let <math>f : \~~mathbb{R}~~Reals^n \to \~~mathbb{R}~~Reals</math> be a [[convex function]] with ___domain <math>\~~mathbb{R}~~Reals^n.</math>. A classical subgradient method iterates :<math display=block>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} \ </math> where <math>g^{(k)}</math> denotes a''any'' [[subgradient]] of <math> f \ </math> at <math>x^{(k)}, \ </math>. and If<math>x^{(k)}</math> is the <math>k^{th}</math> iterate of <math>x.</math> If <math>f \ </math> is differentiable, then its only subgradient is the gradient vector <math>\nabla f</math> itself. It may happen that <math>-g^{(k)}</math> is not a descent direction for <math>f \ </math> at <math>x^{(k)}.</math>. We therefore maintain a list <math>f_{\rm{best}} \ </math> that keeps track of the lowest objective function value found so far, i.e. :<math display=block>f_{\rm{best}}^{(k)} = \min\{f_{\rm{best}}^{(k-1)} , f(x^{(k)}) \}.</math> ===Step size rules=== Line 20 ⟶ 23: Constant step size, <math>\alpha_k = \alpha.</math> Constant step length, <math>\alpha_k = \gamma/\lVert g^{(k)} \rVert_2,</math>, which gives <math>\lVert x^{(k+1)} - x^{(k)} \rVert_2 = \gamma.</math> Square summable but not summable step size, i.e. any step sizes satisfying <math display="block">\alpha_k\geq0,\qquad\sum_{k=1}^\infty \alpha_k^2 < \infty,\qquad \sum_{k=1}^\infty \alpha_k = \infty.</math> :Nonsummable diminishing, i.e. any step sizes satisfying <math display="block">\alpha_k \~~geq0~~geq 0,\qquad \~~sum_~~lim_{k~~=1}^~~\to\infty} \alpha_k^2 <= ~~\infty~~0,\qquad \sum_{k=1}^\infty \alpha_k = \infty.</math> Nonsummable diminishing step lengths, i.e. <math>\alpha_k = \gamma_k/\lVert g^{(k)} \rVert_2,</math> where <math display="block">\gamma_k \geq 0,\qquad \lim_{k\to\infty} \gamma_k = 0,\qquad \sum_{k=1}^\infty \gamma_k = \infty.</math> Nonsummable diminishing, i.e. any step sizes satisfying For all five rules, the step-sizes are determined "off-line", before the method is iterated; the step-sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step-size rules used for descent methods for differentiable functions: Many methods for minimizing differentiable functions satisfy Wolfe's sufficient conditions for convergence, where step-sizes typically depend on the current point and the current search-direction. An extensive discussion of stepsize rules for subgradient methods, including incremental versions, is given in the books by Bertsekas<ref>{{cite book ~~:<math>\alpha_k\geq0,\qquad \lim_{k\to\infty} \alpha_k = 0,\qquad \sum_{k=1}^\infty \alpha_k = \infty.</math>~~ \| last = Bertsekas Nonsummable diminishing step lengths, i.e. <math>\alpha_k = \gamma_k/\lVert g^{(k)} \rVert_2</math>, where \| first = Dimitri P. ~~:<math>\gamma_k\geq0,\qquad \lim_{k\to\infty} \gamma_k = 0,\qquad \sum_{k=1}^\infty \gamma_k = \infty.</math>~~ \| author-link = Dimitri P. Bertsekas For all five rules, the step-sizes are determined "off-line", before the method is iterated; the step-sizes do not depend on preceding iterations. This "off-line" property of subgradient methods differs from the "on-line" step-size rules used for descent methods for differentiable functions: Many methods for minimizing differentiable functions satisfy Wolfe's sufficient conditions for convergence, where step-sizes typically depend on the current point and the current search-direction. \| title = Convex Optimization Algorithms \| edition = Second \| publisher = Athena Scientific \| year = 2015 \| ___location = Belmont, MA. \| isbn = 978-1-886529-28-1 }}</ref> and by Bertsekas, Nedic, and Ozdaglar.<ref>{{cite book \|last1=Bertsekas \|first1=Dimitri P. \|last2=Nedic \|first2=Angelia \|last3 = Ozdaglar \|first3 = Asuman \| title = Convex Analysis and Optimization \| edition = Second \| publisher = Athena Scientific \| year = 2003 \| ___location = Belmont, MA. \| isbn = 1-886529-45-0 }}</ref> ===Convergence results=== For constant step-length and scaled subgradients having [[Euclidean norm]] equal to one, the subgradient method converges to an arbitrarily close approximation to the minimum value, that is :<math>\lim_{k\to\infty} f_{\rm{best}}^{(k)} - f^ <\epsilon</math> by a result of [[Naum Z. Shor\|Shor]].<ref> The approximate convergence of the constant step-size (scaled) subgradient method is stated as Exercise 6.3.14(a) in [[Dimitri P. Bertsekas\|Bertsekas]] (page 636): {{cite book \| last = Bertsekas \| first = Dimitri P. \| ~~authorlink~~author-link = Dimitri P. Bertsekas \| title = Nonlinear Programming \| edition = Second \| publisher = Athena Scientific \| ~~date~~year = 1999 \| ___location = Cambridge, MA. \| isbn = 1-886529-00-0 }} On page 636, Bertsekas attributes this result to ~~[[Naum Z.~~ Shor~~\|Shor]]~~: {{cite book \| last = Shor \| first = Naum Z. \| ~~authorlink~~author-link = Naum Z. Shor \| title = Minimization Methods for Non-differentiable Functions \| publisher = [[Springer-Verlag]] \| isbn = 0-387-12763-1 \| ~~date~~year = 1985 }} </ref> These classical subgradient methods have poor performance and are no longer recommended for general use.<ref name="Lem"/><ref name="KLL"/> However, they are still used widely in specialized applications because they are simple and they can be easily adapted to take advantage of the special structure of the problem at hand. ~~==Subgradient-projection methods==~~ ==Subgradient-projection and bundle methods== During the 1970s, [[Claude Lemaréchal]] and Phil. Wolfe proposed "bundle methods" of descent for problems of convex minimization.<ref> {{cite book \| last = Bertsekas \| first = Dimitri P. \| ~~authorlink~~author-link = Dimitri P. Bertsekas \| title = Nonlinear Programming \| edition = Second \| publisher = Athena Scientific \| ~~date~~year = 1999 \| ___location = Cambridge, MA. \| isbn = 1-886529-00-0 }} </ref> The meaning of the term "bundle methods" has changed significantly since that time. Modern versions and full convergence analysis were provided by Kiwiel. </ref> Contemporary bundle-methods often use "[[level set\|level]] control" rules for choosing step-sizes, developing techniques from the "subgradient-projection" method of Boris T. Polyak (1969). However, there are problems on which bundle methods offer little advantage over subgradient-projection methods.<ref> <ref> {{cite book\| last=Lemaréchal\|first=Claude\|authorlink=Claude Lemaréchal\|chapter=Lagrangian relaxation\|pages=112–156\|url=http://dx.doi.org/10.1007/3-540-45586-8_4\|title=Computational combinatorial optimization: Papers from the Spring School held in Schloß Dagstuhl, May 15–19, 2000\|editor=Michael Jünger and Denis Naddef\|series=Lecture Notes in Computer Science\|volume=2241\|publisher=Springer-Verlag\| ___location=Berlin\|year=2001\|isbn=3-540-42877-1\|id={{MR\|1900016}}.{{doi\|10.1007/3-540-45586-8_4}}\|}} </ref><ref> {{cite book\|last=Kiwiel\|first=Krzysztof\|title=Methods of Descent for Nondifferentiable Optimization\|publisher=[[Springer Verlag]]\|___location=Berlin\|year=1985\|pages=362\|isbn=978-3540156420 \|mr=0797754}} {{cite article\| last1=Kiwiel \| first1=Krzysztof C. \| last2=Larsson \|first2=Torbjörn \| last3=Lindberg \| first3=P. O. \| title=Lagrangian relaxation via ballstep subgradient methods \| journal=Mathematics of Operations Research \| volume=32 \| year=2007 \| number=3\| pages=669–686 \| month=August \| id={{MR\|2348241}}.{{doi\|10.1287/moor.1070.0261}} \| }} </ref> Contemporary bundle-methods often use "[[level set\|level]] control" rules for choosing step-sizes, developing techniques from the "subgradient-projection" method of Boris T. Polyak (1969). However, there are problems on which bundle methods offer little advantage over subgradient-projection methods.<ref name="Lem"> {{cite book\| last=Lemaréchal\|first=Claude\|author-link=Claude Lemaréchal\|chapter=Lagrangian relaxation\|pages=112–156\|title=Computational combinatorial optimization: Papers from the Spring School held in Schloß Dagstuhl, May 15–19, 2000\|editor=Michael Jünger and Denis Naddef\|series=Lecture Notes in Computer Science\|volume=2241\|publisher=Springer-Verlag\| ___location=Berlin\|year=2001\|isbn=3-540-42877-1\|mr=1900016\|doi=10.1007/3-540-45586-8_4\|s2cid=9048698 }}</ref><ref name="KLL"> {{cite journal\|last1=Kiwiel\|first1=Krzysztof C.\|last2=Larsson \|first2=Torbjörn\|last3=Lindberg\|first3=P. O.\|title=Lagrangian relaxation via ballstep subgradient methods\|url=http://rcin.org.pl/Content/139438/PDF/RB-2002-76.pdf \|journal=[[Mathematics of Operations Research]]\|volume=32\|date=August 2007\|number=3\|pages=669–686\|mr=2348241\|doi=10.1287/moor.1070.0261}} </ref> ==Constrained optimization== ===Projected subgradient=== One extension of the subgradient method is the '''projected subgradient method''', which solves the constrained [[Mathematical optimization\|optimization]] problem :minimize <math>f(x) \ </math> subject to <math display=block>x \in \mathcal{C}</math> ~~:<math>x\in\mathcal{C}</math>~~ ~~where <math>\mathcal{C}</math> is a convex set. The projected subgradient method uses the iteration~~ ~~:<math>x^{(k+1)} = P \left(x^{(k)} - \alpha_k g^{(k)} \right) </math>~~ where <math>\mathcal{C}</math> is a [[convex set]]. The projected subgradient method uses the iteration <math display=block>x^{(k+1)} = P \left(x^{(k)} - \alpha_k g^{(k)}\right)</math> where <math>P</math> is projection on <math>\mathcal{C}</math> and <math>g^{(k)}</math> is any subgradient of <math>f \ </math> at <math>x^{(k)}.</math> Line 88 ⟶ 112: The subgradient method can be extended to solve the inequality constrained problem :minimize <math>f_0(x) \ </math> subject to <math display=block>f_i (x) \leq 0,\quad i = 1,\ldots,m</math> ~~:<math>f_i (x) \leq 0,\quad i = 1,\dots,m</math>~~ where <math>f_i</math> are convex. The algorithm takes the same form as the unconstrained case <math display=block>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} \ </math> ~~:<math>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} \ </math>~~ where <math>\alpha_k>0</math> is a step size, and <math>g^{(k)}</math> is a subgradient of the objective or one of the constraint functions at <math>x. \ </math> Take <math display=block>g^{(k)} = ~~:<math>g^{(k)} =~~ \begin{cases} \partial f_0 (x) & \text{ if } f_i(x) \leq 0 \; \forall i = 1 \dots m \\ \partial f_j (x) & \text{ for some } j \text{ such that } f_j(x) > 0 \end{cases}</math> where <math>\partial f</math> denotes the [[subdifferential]] of <math>f. \ </math> If the current point is feasible, the algorithm uses an objective subgradient; if the current point is infeasible, the algorithm chooses a subgradient of any violated constraint. ==See also== where <math>\partial f</math> denotes the subdifferential of <math>f \ </math>. If the current point is feasible, the algorithm uses an objective subgradient; if the current point is infeasible, the algorithm chooses a subgradient of any violated constraint. * {{annotated link\|Stochastic gradient descent}} ==References== ~~<references/>~~ {{reflist}} ==Further reading== * {{cite book \| last = Bertsekas \| first = Dimitri P. ~~\| authorlink = Dimitri P. Bertsekas~~ \| title = Nonlinear Programming \| publisher = Athena Scientific \| ~~date~~year = 1999 \| ___location = ~~Cambridge~~Belmont, MA. \| isbn = 1-886529-00-0 }} * {{cite book \|last1=Bertsekas \|first1=Dimitri P. \|last2=Nedic \|first2=Angelia \|last3 = Ozdaglar \|first3 = Asuman \| title = Convex Analysis and Optimization \| edition = Second \| publisher = Athena Scientific \| year = 2003 \| ___location = Belmont, MA. \| isbn = 1-886529-45-0 }} * {{cite book \| last = Bertsekas \| first = Dimitri P. \| title = Convex Optimization Algorithms \| publisher = Athena Scientific \| year = 2015 \| ___location = Belmont, MA. \| isbn = 978-1-886529-28-1 }} * {{cite book \| last = Shor \| first = Naum Z. ~~\| authorlink = Naum Z. Shor~~ \| title = Minimization Methods for Non-differentiable Functions \| publisher = [[Springer-Verlag]] \| isbn = 0-387-12763-1 \| ~~date~~year = 1985 }} * {{cite book\|last=Ruszczyński\|author-link=Andrzej Piotr Ruszczyński\|first=Andrzej\|title=Nonlinear Optimization\|publisher=[[Princeton University Press]]\|___location=Princeton, NJ\|year=2006\|pages=xii+454\|isbn=978-0691119151 \|mr=2199043}} ==External links== * [http://www.stanford.edu/class/ee364/ EE364] and [http://www.stanford.edu/class/ee364b/ EE364b], a Stanford course homepage * [http://www.stanford.edu/class/ee364a/ EE364A] and [http://www.stanford.edu/class/ee364b/ EE364B], Stanford's convex optimization course sequence. ~~[[Category:Mathematical optimization]]~~ ~~[[Category:Convex optimization]]~~ {{Convex analysis and variational analysis}} ~~[[zh:次梯度法]]~~ {{optimization algorithms\|convex}} [[Category:Convex analysis]] [[Category:Convex optimization]] [[Category:Mathematical optimization]] [[Category:Optimization algorithms and methods]]