Proximal gradient method: Difference between revisions

Content deleted Content added
Improved flow of the text
Rambor12 (talk | contribs)
mNo edit summary
 
(8 intermediate revisions by 4 users not shown)
Line 1:
{{Short description|Form of projection}}
{{more footnotes|date=November 2013}}
 
'''Proximal gradient methods''' are a generalized form of projection used to solve non-differentiable [[convex optimization]] problems.
[[File:Frank_Wolfe_vs_Projected_Gradient.webm|thumb|A comparison between the iterates of the projected gradient method (in red) and the [[Frank–Wolfe algorithm|Frank-Wolfe method]] (in green).]]
 
Many interesting problems can be formulated as convex optimization problems of the form
 
<math>
\operatornamemin_{min}\limits_mathbf{x} \in \mathbb{R}^Nd} \sum_{i=1}^n f_i(\mathbf{x})
</math>
 
where <math>f_i: \mathbb{R}^Nd \rightarrow \mathbb{R},\ i = 1, \dots, n</math> are possibly non-differentiable [[convex functions]]. The lack of differentiability rules out conventional smooth optimization techniques like the [[Gradient descent|steepest descent method]] and the [[conjugate gradient method]], but proximal gradient methods can be used instead.
 
Proximal gradient methods starts by a splitting step, in which the functions <math>f_1, . . . , f_n</math> are used individually so as to yield an easily [[wikt:implementable|implementable]] algorithm. They are called [[proximal]] because each non-differentiable function among <math>f_1, . . . , f_n</math> is involved via its [[Proximal operator|proximity operator]]. Iterative shrinkage thresholding algorithm,<ref>
Line 16 ⟶ 17:
 
For the theory of proximal gradient methods from the perspective of and with applications to [[statistical learning theory]], see [[proximal gradient methods for learning]].
 
== Notations and terminology ==
Let <math>\mathbb{R}^N</math>, the <math>N</math>-dimensional [[Euclidean space]], be the ___domain of the function
<math> f: \mathbb{R}^N \rightarrow (-\infty,+\infty]</math>. Suppose <math>C</math> is a non-empty
convex subset of <math>\mathbb{R}^N</math>. Then, the indicator function of <math>C</math> is defined as
 
: <math> \iota_C : x \mapsto
\begin{cases}
0 & \text{if } x \in C \\
+ \infty & \text{if } x \notin C
\end{cases}
</math>
 
: <math>p</math>-norm is defined as ( <math>\| \cdot \|_p</math> )
 
: <math>
\|x\|_p = ( |x_1|^p + |x_2|^p + \cdots + |x_N|^p )^{1/p}
</math>
 
The distance from <math>x \in \mathbb{R}^N</math> to <math>C</math> is defined as
 
: <math>
D_C(x) = \min_{y \in C} \|x - y\|_2
</math>
 
If <math>C</math> is closed and convex, the projection of <math>x \in \mathbb{R}^N</math> onto <math>C</math> is the unique point
<math>P_Cx \in C</math> such that <math>D_C(x) = \| x - P_Cx \|_2 </math>.
 
The [[subdifferential]] of <math>f</math> at <math>x</math> is given by
 
: <math>
\partial f(x) = \{ u \in \mathbb{R}^N \mid \forall y \in \mathbb{R}^N, (y-x)^\mathrm{T}u+f(x) \leq f(y).\}
</math>
 
== Projection onto convex sets (POCS) ==
Line 56 ⟶ 24:
x_{k+1} = P_{C_1} P_{C_2} \cdots P_{C_n} x_k
</math>
However beyond such problems [[projection operator]]s are not appropriate and more general operators are required to tackle them. Among the various generalizations of the notion of a convex projection operator that exist, proximityproximal operators are best suited for other purposes.
 
== Definition ==
The [[proximal operator|proximity operator]] of a convex function <math>f</math> at <math>x</math> is defined as the unique solution to
:<math>
\operatorname{argmin}\limits_y \bigg( f(y) + \frac{1}{2} \left\| x-y \right\|_2^2 \bigg)
</math>
and is denoted <math>\operatorname{prox}_f(x)</math>.
 
: <math>
\operatorname{prox}_f(x) :\mathbb{R}^N \rightarrow \mathbb{R}^N
</math>
 
Note that in the specific case where <math>f</math> is the indicator function <math>\iota_C</math> of some convex set <math>C</math>
 
: <math>
\begin{align}
\operatorname{prox}_{\iota_C}(x)
&= \operatorname{argmin}\limits_y
\begin{cases}
\frac{1}{2} \left\| x-y \right\|_2^2 & \text{if } y \in C \\
+ \infty & \text{if } y \notin C
\end{cases} \\
&=\operatorname{argmin}\limits_{y \in C} \frac{1}{2} \left\| x-y \right\|_2^2 \\
&= P_C(x)
\end{align}
</math>
 
showing that the proximity operator is indeed a generalisation of the projection operator.
 
The proximity operator of <math>f</math> is characterized by inclusion
 
: <math>
p=\operatorname{prox}_f(x) \Leftrightarrow x-p \in \partial f(p) \qquad (\forall(x,p) \in \mathbb{R}^N \times \mathbb{R}^N)
</math>
 
If <math>f</math> is differentiable then above equation reduces to
 
: <math>
p=\operatorname{prox}_f(x) \Leftrightarrow x-p = \nabla f(p) \quad (\forall(x,p) \in \mathbb{R}^N \times \mathbb{R}^N)
</math>
 
== Examples ==
Line 105 ⟶ 33:
 
== See also ==
* [[Alternating projection]]
* [[Convex optimization]]
* [[Frank–Wolfe algorithm]]
* [[Proximal operator]]
* [[Proximal gradient methods for learning]]
* [[Frank–Wolfe algorithm]]
 
== Notes ==