Content deleted Content added
m Reverted edit by 113.184.80.74 (talk) to last version by Esg08 |
|||
(67 intermediate revisions by 29 users not shown) | |||
Line 1:
'''Proximal gradient''' (forward backward splitting) '''methods for learning''' is an area of research in [[optimization]] and [[statistical learning theory]] which studies algorithms for a general class of [[
▲'''Proximal gradient''' (forward backward splitting) '''methods for learning''' is an area of research in [[optimization]] and [[statistical learning theory]] which studies algorithms for a general class of [[Convex_function#Definition|convex]] [[Regularization_(mathematics)|regularization]] problems where the regularization penalty may not be [[Differentiable_function|differentiable]]. One such example is <math>\ell_1</math> regularization (also known as Lasso) of the form
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal|last=Combettes|first=Patrick L.|
== Relevant background ==
[[
:<math> \min_{x\in \mathcal{H}} F(x)+R(x),</math>
where <math>F</math> is [[
:<math> 0\in \partial (F+R)(x), </math>
where <math>\partial \varphi</math> denotes the [[subdifferential]] of a real-valued, convex function <math> \varphi</math>.
Given a convex function <math>\varphi:\mathcal{H} \to \mathbb{R}</math> an important operator to consider is its
:<math> \operatorname{prox}_{\varphi}(u) = \operatorname{arg}\min_{x\in\mathcal{H}} \varphi(x)+\frac{1}{2}\|u-x\|_2^2,</math>
which is well-defined because of the strict convexity of the <math> \ell_2</math> norm. The
We see that the proximity operator is important because <math> x^* </math> is a minimizer to the problem <math> \min_{x\in\mathcal{H}} F(x)+R(x)</math> if and only if
▲We see that the proximity operator is important because <math> x^* </math> is a minimizer to the problem <math> \min_{x\in\mathcal{H}} F(x)+R(x)</math> if and only if
:<math>x^* = \operatorname{prox}_{\gamma R}\left(x^*-\gamma\nabla F(x^*)\right),</math> where <math>\gamma>0</math> is any positive real number.<ref name=combettes />
=== Moreau decomposition ===
One important technique related to proximal gradient methods is the '''Moreau decomposition,''' which decomposes the identity operator as the sum of two proximity operators.<ref name=combettes /> Namely, let <math>\varphi:\mathcal{X}\to\mathbb{R}</math> be a [[Semi-continuity|lower semicontinuous]], convex function on a vector space <math>\mathcal{X}</math>. We define its [[
:<math>\varphi^*(u) := \sup_{x\in\mathcal{X}} \langle x,u\rangle - \varphi(x).</math>
The general form of Moreau's decomposition states that for any <math>x\in\mathcal{X}</math> and any <math>\gamma>0</math> that
:<math>x = \operatorname{prox}_{\gamma \varphi}(x) + \gamma\operatorname{prox}_{\varphi^*/\gamma}(x/\gamma),</math>
which for <math>\gamma=1</math> implies that <math>x = \operatorname{prox}_{\varphi}(x)+\operatorname{prox}_{\varphi^*}(x)</math>.<ref name=combettes /><ref name=moreau>{{cite journal|last=Moreau|first=J.-J.|title=Fonctions convexes duales et points proximaux dans un espace hilbertien|journal=
In certain situations it may be easier to compute the proximity operator for the conjugate <math>\varphi^*</math> instead of the function <math>\varphi</math>, and therefore the Moreau decomposition can be applied. This is the case for [[Lasso (statistics)#
== Lasso regularization ==
Consider the [[
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, </math>
where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> The <math>\ell_1</math> regularization problem is sometimes referred to as
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_0, </math>
where <math>\|w\|_0</math> denotes the <math>\ell_0</math> "norm", which is the number of nonzero entries of the vector <math>w</math>. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.<ref name=tibshirani>{{cite journal|last=Tibshirani|first=R.|title=Regression shrinkage and selection via the lasso|journal=J. R. Stat. Soc.
=== Solving for L<
For simplicity we restrict our attention to the problem where <math>\lambda=1</math>. To solve the problem
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \|w\|_1, </math>
we consider our objective function in two parts: a convex, differentiable term <math>F(w) = \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2</math> and a convex function <math>R(w) = \|w\|_1</math>. Note that <math>R</math> is not strictly convex.
Line 54 ⟶ 51:
\end{align}
</math>
For <math>R(w) = \|w\|_1</math> it is easy to compute <math>\partial R(w)</math>: the <math>i</math>th entry of <math>\partial R(w)</math> is precisely
:<math> \partial |w_i| =
1,&w_i>0\\
-1,&w_i<0\\
\left[-1,1\right],&w_i = 0.
\end{
Using the recharacterization of the proximity operator given above, for the choice of <math>R(w) = \|w\|_1</math> and <math>\gamma>0</math> we have that <math>\operatorname{prox}_{\gamma R}(x)</math> is defined entrywise by
::<math>\left(\operatorname{prox}_{\gamma R}(x)\right)_i =
x_i-\gamma,&x_i>\gamma\\
0,&|x_i|\leq\gamma\\
x_i+\gamma,&x_i<-\gamma,
\end{
which is known as the [[
=== Fixed point iterative schemes ===▼
▲===Fixed point iterative schemes===
To finally solve the lasso problem we consider the fixed point equation shown earlier:
:<math>x^* = \operatorname{prox}_{\gamma R}\left(x^*-\gamma\nabla F(x^*)\right).</math>
Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial <math>w^0\in\mathbb{R}^d</math>, and for <math>k=1,2,\ldots</math> define
:<math>w^{k+1} = S_{\gamma}\left(w^k - \gamma \nabla F\left(w^k\right)\right).</math>
Note here the effective trade-off between the empirical error term <math>F(w) </math> and the regularization
Convergence of this fixed point scheme is well-studied in the literature<ref name=combettes /><ref name=daubechies /> and is guaranteed under appropriate choice of step size <math>\gamma</math> and [[loss function]] (such as the square loss taken here). [[Gradient descent#Extensions|Accelerated methods]] were introduced by Nesterov in 1983 which improve the [[rate of convergence]] under certain regularity assumptions on <math>F</math>.
For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal|last=Villa|first=S.|
== Practical considerations ==▼
▲For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal|last=Villa|first=S.|coauthors=Salzo, S., Baldassarre, L., and Verri, A.|title=Accelerated and inexact forward-backward algorithms|journal=SIAM J. Optim.|year=2013|volume=23|issue=3|pages=1607-1633|doi=10.1137/110844805}}</ref>
There have been numerous developments within the past decade in [[
▲==Practical considerations==
=== Adaptive step size ===▼
▲There have been numerous developments within the past decade in [[Convex_optimization|convex optimization]] techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.<ref name=structSparse /><ref name=bach>{{cite journal|last=Bach|first=F.|coauthors=Jenatton, R., Mairal, J., and Obozinski, Gl.|title=Optimization with sparsity-inducing penalties|journal=Found. & Trends Mach. Learn.|year=2011|volume=4|issue=1|pages=1-106|doi=10.1561/2200000015}}</ref>
▲===Adaptive step size===
In the fixed point iteration scheme
:<math>w^{k+1} = \operatorname{prox}_{\gamma R}\left(w^k-\gamma \nabla F\left(w^k\right)\right),</math>
one can allow variable step size <math>\gamma_k</math> instead of a
=== Elastic net (mixed norm regularization) ===
[[
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2^2\right), </math>▼
▲[[Elastic_net_regularization | Elastic net regularization]] offers an alternative to pure <math>\ell_1</math> regularization. The problem of lasso (<math>\ell_1</math>) regularization involves the penalty term <math>R(w) = \|w\|_1</math>, which is not strictly convex. Hence, solutions to <math>\min_w F(w) + R(w),</math> where <math>F</math> is some empirical loss function, need not be unique. This is often avoided by the inclusion of an additional strictly convex term, such as an <math>\ell_2</math> norm regularization penalty. For example, one can consider the problem
▲:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2\right), </math>
where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
For <math>0<\mu\leq 1</math> the penalty term <math>\lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2^2\right)</math> is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small <math>\mu > 0</math>, the additional penalty term <math>\mu \|w\|_2^2</math> acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.<ref name=structSparse /><ref name=deMolElasticNet>{{cite journal|last=De Mol|first=C. |
== Exploiting
Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in [[statistical learning theory]]. Certain problems in learning can often involve data which has additional structure that is known <i> a priori</i>. In the past several years there have been new developments which incorporate information about group structure to provide methods which are tailored to different applications. Here we survey a few such methods.▼
▲Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in [[statistical learning theory]]. Certain problems in learning can often involve data which has additional structure that is known
===Group lasso===▼
▲=== Group lasso ===
Group lasso is a generalization of the [[
▲Group lasso is a generalization of the [[#Lasso regularization|lasso method]] when features are grouped into disjoint blocks.<ref name=groupLasso>{{cite journal|last=Yuan|first=M.|coauthors=Lin, Y.|title=Model selection and estimation in regression with grouped variables|journal=J. R. Stat. Soc. B|year=2006|volume=68|issue=1|pages=49-67|doi=10.1111/j.1467-9868.2005.00532.x}}</ref> Suppose the features are grouped into blocks <math>\{w_1,\ldots,w_G\}</math>. Here we take as a regularization penalty
:<math>R(w) =\sum_{g=1}^G \|w_g\|_2,</math>
which is the sum of the <math>\ell_2</math> norm on corresponding feature vectors for the different groups. A similar proximity operator analysis as above can be used to compute the proximity operator for this penalty. Where the lasso penalty has a proximity operator which is soft thresholding on each individual component, the proximity operator for the group lasso is soft thresholding on each group. For the group <math>w_g</math> we have that proximity operator of <math>\lambda\gamma\left(\sum_{g=1}^G \|w_g\|_2\right) </math> is given by
0, & \|w_g\|_2\leq \lambda\gamma
\end{
where <math>w_g</math> is the <math>g</math>th group.
In contrast to lasso, the derivation of the proximity operator for group lasso relies on the [[#
=== Other group structures ===
In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested
== See also ==
* [[
* [[Regularization (mathematics)#Other uses of regularization in statistics and machine learning|Regularization]]
*[[:Category:First_order_methods|First order methods]]▼
* [[
▲*[[Convex_analysis|Convex analysis]]
== References ==
{{reflist}}
[[Category:Convex optimization]]
[[Category:Machine learning]]
|