Proximal gradient methods for learning: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Removed proxy/dead URL that duplicated identifier. | Use this bot. Report bugs. | #UCB_CommandLine
m Reverted edit by 113.184.80.74 (talk) to last version by Esg08
Tags: Rollback Mobile edit Mobile web edit
 
(3 intermediate revisions by 3 users not shown)
Line 24:
The general form of Moreau's decomposition states that for any <math>x\in\mathcal{X}</math> and any <math>\gamma>0</math> that
:<math>x = \operatorname{prox}_{\gamma \varphi}(x) + \gamma\operatorname{prox}_{\varphi^*/\gamma}(x/\gamma),</math>
which for <math>\gamma=1</math> implies that <math>x = \operatorname{prox}_{\varphi}(x)+\operatorname{prox}_{\varphi^*}(x)</math>.<ref name=combettes /><ref name=moreau>{{cite journal|last=Moreau|first=J.-J.|title=Fonctions convexes duales et points proximaux dans un espace hilbertien|journal=Comptes Rendus de l'Académie des Sciences, Série A|year=1962|volume=255|pages=2897–2899|mr=144188|zbl=0118.10502}}</ref> The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a [[vector space]], analogous with the fact that proximity operators are generalizations of projections.<ref name=combettes />
 
In certain situations it may be easier to compute the proximity operator for the conjugate <math>\varphi^*</math> instead of the function <math>\varphi</math>, and therefore the Moreau decomposition can be applied. This is the case for [[Lasso (statistics)#Group LASSO|group lasso]].
Line 34:
where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> The <math>\ell_1</math> regularization problem is sometimes referred to as ''lasso'' ([[Lasso (statistics)|least absolute shrinkage and selection operator]]).<ref name=tibshirani /> Such <math>\ell_1</math> regularization problems are interesting because they induce '' sparse'' solutions, that is, solutions <math>w</math> to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_0, </math>
where <math>\|w\|_0</math> denotes the <math>\ell_0</math> "norm", which is the number of nonzero entries of the vector <math>w</math>. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.<ref name=tibshirani>{{cite journal|last=Tibshirani|first=R.|title=Regression shrinkage and selection via the lasso|journal=J. R. Stat. Soc. Ser. B|year=1996|volume=58|series=1|issue=1|pages=267–288|doi=10.1111/j.2517-6161.1996.tb02080.x }}</ref>
 
=== Solving for L<sub>1</sub> proximity operator ===
Line 79:
Note here the effective trade-off between the empirical error term <math>F(w) </math> and the regularization penalty <math>R(w)</math>. This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step (<math> w^k - \gamma \nabla F\left(w^k\right)</math>) and a soft thresholding step (via <math>S_\gamma</math>).
 
Convergence of this fixed point scheme is well-studied in the literature<ref name=combettes /><ref name=daubechies /> and is guaranteed under appropriate choice of step size <math>\gamma</math> and [[loss function]] (such as the square loss taken here). [[Gradient descent#Extensions|Accelerated methods]] were introduced by Nesterov in 1983 which improve the [[rate of convergence]] under certain regularity assumptions on <math>F</math>.<ref name=nesterov>{{cite journal|last=Nesterov|first=Yurii|title=A method of solving a convex programming problem with convergence rate <math>O(1/k^2)</math>|journal=Soviet Mathematics - Doklady|year=1983|volume=27|issue=2|pages=372–376}}</ref> Such methods have been studied extensively in previous years.<ref>{{cite book|last=Nesterov|first=Yurii|title=Introductory Lectures on Convex Optimization|year=2004|publisher=Kluwer Academic Publisher}}</ref>
For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal|last=Villa|first=S.|author2=Salzo, S. |author3=Baldassarre, L. |author4=Verri, A. |title=Accelerated and inexact forward-backward algorithms|journal=SIAM J. Optim.|year=2013|volume=23|issue=3|pages=1607–1633|doi=10.1137/110844805|citeseerx=10.1.1.416.3633|s2cid=11379846 }}</ref>