Proximal gradient method: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:31, 21 October 2019 edit 128.93.64.155 (talk) →Notations and terminology ← Previous edit		Latest revision as of 12:28, 21 June 2025 edit undo Rambor12 (talk \| contribs) 105 edits mNo edit summary Tag: 2017 wikitext editor
(17 intermediate revisions by 9 users not shown)
Line 1: {{Short description\|Form of projection}} {{more footnotes\|date=November 2013}} '''Proximal gradient methods''' are a generalized form of projection used to solve non-differentiable [[convex optimization]] problems. [[File:Frank_Wolfe_vs_Projected_Gradient.webm\|thumb\|A comparison between the iterates of the projected gradient method (in red) and the [[Frank–Wolfe algorithm\|Frank-Wolfe method]] (in green).]] Many interesting problems can be formulated as convex optimization problems of the form: <math> \~~operatorname~~min_{~~min}~~\~~limits_~~mathbf{x} \in \mathbb{R}^Nd} \sum_{i=1}^n f_i(\mathbf{x}) </math> where <math>f_i: \mathbb{R}^d \rightarrow \mathbb{R},\ i = 1, \dots, n</math> are possibly non-differentiable [[convex functions]]. The lack of differentiability rules out conventional smooth optimization techniques like the [[Gradient descent\|steepest descent method]] and the [[conjugate gradient method]], but proximal gradient methods can be used instead. ~~where <math>f_i,\ i = 1, \dots, n</math> are [[convex functions]] defined from <math>f: \mathbb{R}^N \rightarrow \mathbb{R} </math>~~ ~~where some of the functions are non-differentiable, this rules out our conventional smooth optimization techniques like~~ [[Gradient descent\|Steepest descent method]], [[conjugate gradient method]] etc. Proximal gradient methods can be used instead. These methods proceed by splitting, in that the functions <math>f_1, . . . , f_n</math> are used individually so as to yield an easily [[wikt:implementable\|implementable]] algorithm. ~~They are called [[proximal]] because each non [[smooth function]] among <math>f_1, . . . , f_n</math> is involved via its proximity~~ ~~operator. Iterative Shrinkage thresholding algorithm,<ref>~~ ~~{{cite news \| last1="Daubechies \| first1=I \| last2=Defrise \| first2 = M \| last3 = De Mol\| first3 = C\| title=An iterative thresholding algorithm for linear inverse problems with a sparsity constraint~~ ~~\|journal=A Journal Issued by the Courant Institute of Mathematical Sciences\|volume=57 \|year=2004\|pages=1413–1457}}</ref> [[Landweber iteration\|projected Landweber]], projected~~ ~~gradient, [[alternating projection]]s, [[Alternating direction method of multipliers#Alternating direction method of multipliers\|alternating-direction method of multipliers]], alternating~~ split [[Bregman method\|Bregman]] are special instances of proximal algorithms. Details of proximal methods are discussed in Combettes and Pesquet.<ref>▼ {{cite arXiv \|last1=Combettes \|first1=Patrick L. \|last2= Pesquet \|first2=Jean-Christophe \|title=Proximal Splitting Methods in Signal Processing\|page=\|year=2009 \|arxiv=0912.3522}}</ref> For the theory of proximal gradient methods from the perspective of and with applications to [[statistical learning theory]], see [[proximal gradient methods for learning]].▼ Proximal gradient methods starts by a splitting step, in which the functions <math>f_1, . . . , f_n</math> are used individually so as to yield an easily [[wikt:implementable\|implementable]] algorithm. They are called [[proximal]] because each non-differentiable function among <math>f_1, . . . , f_n</math> is involved via its [[Proximal operator\|proximity operator]]. Iterative shrinkage thresholding algorithm,<ref> ~~== Notations and terminology ==~~ {{cite journal \| last1=Daubechies \| first1=I \| last2=Defrise \| first2 = M \| last3 = De Mol\| first3 = C\|author3-link= Christine De Mol \| title=An iterative thresholding algorithm for linear inverse problems with a sparsity constraint \|journal= Communications on Pure and Applied Mathematics\|volume=57 \| issue=11 \|year=2004\|pages=1413–1457\| bibcode=2003math......7152D \| arxiv=math/0307152 \|doi=10.1002/cpa.20042}}</ref> [[Landweber iteration\|projected Landweber]], projected gradient, [[alternating projection]]s, [[Alternating direction method of multipliers#Alternating direction method of multipliers\|alternating-direction method of multipliers]], alternating ~~Let <math>\mathbb{R}^N</math>, the <math>N</math>-dimensional [[Euclidean space]], be the ___domain of the function~~ ▲split [[Bregman method\|Bregman]] are special instances of proximal algorithms. <ref>Details of proximal methods are discussed in {{cite arXiv \|last1=Combettes ~~and~~\|first1=Patrick L. \|last2= Pesquet \|first2=Jean-Christophe \|title=Proximal Splitting Methods in Signal Processing\|page=\|year=2009 \|eprint=0912.3522\|class=math.OC }}</ref> ~~<math> f: \mathbb{R}^N \rightarrow (-\infty,+\infty]</math>. Suppose <math>C</math> is a non-empty~~ ~~convex subset of <math>\mathbb{R}^N</math>. Then, the indicator function of <math>C</math> is defined as~~ ▲~~{{cite arXiv \|last1=Combettes \|first1=Patrick L. \|last2= Pesquet \|first2=Jean-Christophe \|title=Proximal Splitting Methods in Signal Processing\|page=\|year=2009 \|arxiv=0912.3522}}</ref>~~ For the theory of proximal gradient methods from the perspective of and with applications to [[statistical learning theory]], see [[proximal gradient methods for learning]]. ~~: <math> \iota_C : x \mapsto~~ ~~\begin{cases}~~ ~~0 & \text{if } x \in C \\~~ ~~+ \infty & \text{if } x \notin C~~ ~~\end{cases}~~ ~~</math>~~ ~~: <math>p</math>-norm is defined as ( <math>\\| \cdot \\|_p</math> )~~ ~~: <math>~~ ~~\\|x\\|_p = ( \|x_1\|^p + \|x_2\|^p + \cdots + \|x_N\|^p )^{1/p}~~ ~~</math>~~ ~~The distance from <math>x \in \mathbb{R}^N</math> to <math>C</math> is defined as~~ ~~: <math>~~ ~~D_C(x) = \min_{y \in C} \\|x - y\\|_2~~ ~~</math>~~ ~~If <math>C</math> is closed and convex, the projection of <math>x \in \mathbb{R}^N</math> onto <math>C</math> is the unique point~~ ~~<math>P_Cx \in C</math> such that <math>D_C(x) = \\| x - P_Cx \\|_2 </math>.~~ ~~The [[subdifferential]] of <math>f</math> at <math>x</math> is given by~~ ~~: <math>~~ ~~\partial f(x) = \{ u \in \mathbb{R}^N \mid \forall y \in \mathbb{R}^N, (y-x)^\mathrm{T}u+f(x) \leq f(y).\}~~ ~~</math>~~ == Projection onto convex sets (POCS) == Line 59 ⟶ 24: x_{k+1} = P_{C_1} P_{C_2} \cdots P_{C_n} x_k </math> However beyond such problems [[projection operator]]s are not appropriate and more general operators are required to tackle them. Among the various generalizations of the notion of a convex projection operator that exist, ~~proximity~~proximal operators are best suited for other purposes. ~~== Definition ==~~ ~~The [[proximal operator\|proximity operator]] of a convex function <math>f</math> at <math>x</math> is defined as the unique solution to~~ ~~:<math>~~ ~~\operatorname{argmin}\limits_y \bigg( f(y) + \frac{1}{2} \left\\| x-y \right\\|_2^2 \bigg)~~ ~~</math>~~ ~~and is denoted <math>\operatorname{prox}_f(x)</math>.~~ ~~: <math>~~ ~~\operatorname{prox}_f(x) :\mathbb{R}^N \rightarrow \mathbb{R}^N~~ ~~</math>~~ ~~Note that in the specific case where <math>f</math> is the indicator function <math>\iota_C</math> of some convex set <math>C</math>~~ ~~: <math>~~ ~~\begin{align}~~ ~~\operatorname{prox}_{\iota_C}(x)~~ ~~&= \operatorname{argmin}\limits_y~~ ~~\begin{cases}~~ ~~\frac{1}{2} \left\\| x-y \right\\|_2^2 & \text{if } y \in C \\~~ ~~+ \infty & \text{if } y \notin C~~ ~~\end{cases} \\~~ ~~&=\operatorname{argmin}\limits_{y \in C} \frac{1}{2} \left\\| x-y \right\\|_2^2 \\~~ ~~&= P_C(x)~~ ~~\end{align}~~ ~~</math>~~ ~~showing that the proximity operator is indeed a generalisation of the projection operator.~~ ~~The proximity operator of <math>f</math> is characterized by inclusion~~ ~~: <math>~~ ~~p=\operatorname{prox}_f(x) \Leftrightarrow x-p \in \partial f(p) \qquad (\forall(x,p) \in \mathbb{R}^N \times \mathbb{R}^N)~~ ~~</math>~~ ~~If <math>f</math> is differentiable then above equation reduces to~~ ~~: <math>~~ ~~p=\operatorname{prox}_f(x) \Leftrightarrow x-p = \nabla f(p) \quad (\forall(x,p) \in \mathbb{R}^N \times \mathbb{R}^N)~~ ~~</math>~~ == Examples == Line 106 ⟶ 31: [[Alternating projection]] [[Alternating direction method of multipliers#Alternating direction method of multipliers\|Alternating-direction method of multipliers]] Fast Iterative Shrinkage Thresholding Algorithm (FISTA)<ref> ~~{{cite news \| last1="Beck \| first1=A \| last2=Teboulle \| first2 = M \| title=A fast iterative shrinkage-thresholding algorithm for linear inverse problems~~ ~~\|journal=SIAM Journal on Imaging Sciences\|volume=2 \|year=2009\|pages=183–202}}</ref>~~ == See also == [[Alternating projection]] * [[Convex optimization]] * [[Frank–Wolfe algorithm]]▼ * [[Proximal operator]] * [[Proximal gradient methods for learning]] ▲* [[Frank–Wolfe algorithm]] == Notes == Line 135 ⟶ 55: \| last2 = Pesquet \| first2 = Jean-Christophe \| title = ~~Springer's~~ Fixed-Point Algorithms for Inverse Problems in Science and Engineering \| volume = 49 \| year = 2011 Line 142 ⟶ 62: ==External links== * Stephen Boyd and Lieven Vandenberghe Book, [~~http~~https://~~www~~web.stanford.edu/~boyd/cvxbook/ ''Convex optimization''] * [~~http~~https://~~www~~web.stanford.edu/class/ee364a/ EE364a: Convex Optimization I] and [~~http~~https://~~www~~web.stanford.edu/class/ee364b/ EE364b: Convex Optimization II], Stanford course homepages * [~~http~~https://~~www~~people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf EE227A: Lieven Vandenberghe Notes] Lecture 18 * [https://github.com/kul-forbes/ProximalOperators.jl ProximalOperators.jl]: a [[Julia (programming language)\|Julia]] package implementing proximal operators. * [https://github.com/kul-forbes/ProximalAlgorithms.jl ProximalAlgorithms.jl]: a [[Julia (programming language)\|Julia]] package implementing algorithms based on the proximal operator, including the proximal gradient method.