Proximal gradient methods for learning: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:50, 23 February 2022 edit Sadopaul (talk \| contribs) 90 edits →Lasso regularization Tag: Visual edit ← Previous edit		Latest revision as of 18:58, 29 July 2025 edit undo Lynch44 (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers, Temporary account IP viewers 15,728 edits m Reverted edit by 113.184.80.74 (talk) to last version by Esg08 Tags: Rollback Mobile edit Mobile web edit
(9 intermediate revisions by 5 users not shown)
Line 2: :<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \\|w\\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal\|last=Combettes\|first=Patrick L.\|author2=Wajs, Valérie R. \|title=Signal Recovering by Proximal Forward-Backward Splitting\|journal=Multiscale Model. Simul.\|year=2005\|volume=4\|issue=4\|pages=1168–1200\|doi=10.1137/050626090\|s2cid=15064954~~\|url=https://semanticscholar.org/paper/56974187b4d9a8757f4d8a6fd6facc8b4ad08240~~}}</ref><ref name=structSparse>{{cite ~~journal~~book\|last=Mosci\|first=S.\|author2=Rosasco, L. \|author3=Matteo, S. \|author4=Verri, A. \|author5=Villa, S. \|title=Machine Learning and Knowledge Discovery in Databases \|chapter=Solving Structured Sparsity Regularization with Proximal Methods~~\|journal=Machine~~ ~~Learning and Knowledge Discovery in Databases~~\|year=2010\|volume=6322\|pages=418–433 \|doi=10.1007/978-3-642-15883-4_27\|series=Lecture Notes in Computer Science\|isbn=978-3-642-15882-7\|doi-access=free}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[Lasso (statistics)\|lasso]]) or ''group structure'' (in the case of [[Lasso (statistics)#Group LASSO\|group lasso]]). == Relevant background == Line 24: The general form of Moreau's decomposition states that for any <math>x\in\mathcal{X}</math> and any <math>\gamma>0</math> that :<math>x = \operatorname{prox}_{\gamma \varphi}(x) + \gamma\operatorname{prox}_{\varphi^/\gamma}(x/\gamma),</math> which for <math>\gamma=1</math> implies that <math>x = \operatorname{prox}_{\varphi}(x)+\operatorname{prox}_{\varphi^}(x)</math>.<ref name=combettes /><ref name=moreau>{{cite journal\|last=Moreau\|first=J.-J.\|title=Fonctions convexes duales et points proximaux dans un espace hilbertien\|journal=Comptes Rendus de l'Académie des Sciences, Série A\|year=1962\|volume=255\|pages=2897–2899\|mr=144188\|zbl=0118.10502}}</ref> The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a [[vector space]], analogous with the fact that proximity operators are generalizations of projections.<ref name=combettes /> In certain situations it may be easier to compute the proximity operator for the conjugate <math>\varphi^</math> instead of the function <math>\varphi</math>, and therefore the Moreau decomposition can be applied. This is the case for [[Lasso (statistics)#Group LASSO\|group lasso]]. Line 34: where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> The <math>\ell_1</math> regularization problem is sometimes referred to as ''lasso'' ([[Lasso (statistics)\|least absolute shrinkage and selection operator]]).<ref name=tibshirani /> Such <math>\ell_1</math> regularization problems are interesting because they induce '' sparse'' solutions, that is, solutions <math>w</math> to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem :<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \\|w\\|_0, </math> where <math>\\|w\\|_0</math> denotes the <math>\ell_0</math> "norm", which is the number of nonzero entries of the vector <math>w</math>. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.<ref name=tibshirani>{{cite journal\|last=Tibshirani\|first=R.\|title=Regression shrinkage and selection via the lasso\|journal=J. R. Stat. Soc. Ser. B\|year=1996\|volume=58\|series=1\|issue=1\|pages=267–288\|doi=10.1111/j.2517-6161.1996.tb02080.x }}</ref> === Solving for L<sub>1</sub> proximity operator === Line 79: Note here the effective trade-off between the empirical error term <math>F(w) </math> and the regularization penalty <math>R(w)</math>. This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step (<math> w^k - \gamma \nabla F\left(w^k\right)</math>) and a soft thresholding step (via <math>S_\gamma</math>). Convergence of this fixed point scheme is well-studied in the literature<ref name=combettes /><ref name=daubechies /> and is guaranteed under appropriate choice of step size <math>\gamma</math> and [[loss function]] (such as the square loss taken here). [[Gradient descent#Extensions\|Accelerated methods]] were introduced by Nesterov in 1983 which improve the [[rate of convergence]] under certain regularity assumptions on <math>F</math>.<ref name=nesterov>{{cite journal\|last=Nesterov\|first=Yurii\|title=A method of solving a convex programming problem with convergence rate <math>O(1/k^2)</math>\|journal=Soviet Mathematics - Doklady\|year=1983\|volume=27\|issue=2\|pages=372–376}}</ref> Such methods have been studied extensively in previous years.<ref>{{cite book\|last=Nesterov\|first=Yurii\|title=Introductory Lectures on Convex Optimization\|year=2004\|publisher=Kluwer Academic Publisher}}</ref> For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal\|last=Villa\|first=S.\|author2=Salzo, S. \|author3=Baldassarre, L. \|author4=Verri, A. \|title=Accelerated and inexact forward-backward algorithms\|journal=SIAM J. Optim.\|year=2013\|volume=23\|issue=3\|pages=1607–1633\|doi=10.1137/110844805\|citeseerx=10.1.1.416.3633\|s2cid=11379846 }}</ref> == Practical considerations == Line 90: In the fixed point iteration scheme :<math>w^{k+1} = \operatorname{prox}_{\gamma R}\left(w^k-\gamma \nabla F\left(w^k\right)\right),</math> one can allow variable step size <math>\gamma_k</math> instead of a constant <math>\gamma</math>. Numerous adaptive step size schemes have been proposed throughout the literature.<ref name=combettes /><ref name=bauschke /><ref>{{cite journal\|last=Loris\|first=I. \|author2=Bertero, M. \|author3=De Mol, C. \|author4=Zanella, R. \|author5=Zanni, L. \|title=Accelerating gradient projection methods for <math>\ell_1</math>-constrained signal recovery by steplength selection rules\|journal=Applied & Comp. Harmonic Analysis\|volume=27\|issue=2\|pages=247–254\|year=2009\|doi=10.1016/j.acha.2009.02.003\|arxiv=0902.4424 \|s2cid=18093882 }}</ref><ref>{{cite journal\|last=Wright\|first=S.J.\|author2=Nowak, R.D. \|author3=Figueiredo, M.A.T. \|title=Sparse reconstruction by separable approximation\|journal=IEEE Trans. Image Process.\|year=2009\|volume=57\|issue=7\|pages=2479–2493\|doi=10.1109/TSP.2009.2016892\|bibcode=2009ITSP...57.2479W\|citeseerx=10.1.1.115.9334\|s2cid=7399917 }}</ref> Applications of these schemes<ref name=structSparse /><ref>{{cite journal\|last=Loris\|first=Ignace\|title=On the performance of algorithms for the minimization of <math>\ell_1</math>-penalized functionals\|journal=Inverse Problems\|year=2009\|volume=25\|issue=3\|doi=10.1088/0266-5611/25/3/035008\|page=035008\|arxiv=0710.4082\|bibcode=2009InvPr..25c5008L\|s2cid=14213443}}</ref> suggest that these can offer substantial improvement in number of iterations required for fixed point convergence. === Elastic net (mixed norm regularization) === Line 105: === Group lasso === Group lasso is a generalization of the [[Lasso (statistics)\|lasso method]] when features are grouped into disjoint blocks.<ref name=groupLasso>{{cite journal\|last=Yuan\|first=M.\|author2=Lin, Y. \|title=Model selection and estimation in regression with grouped variables\|journal=J. R. Stat. Soc. B\|year=2006\|volume=68\|issue=1\|pages=49–67\|doi=10.1111/j.1467-9868.2005.00532.x\|s2cid=6162124\|~~url~~doi-access=~~https://semanticscholar.org/paper/d98ef875e2cbde3e2cc8fad521e3cbfe1bddbd69~~free}}</ref> Suppose the features are grouped into blocks <math>\{w_1,\ldots,w_G\}</math>. Here we take as a regularization penalty :<math>R(w) =\sum_{g=1}^G \\|w_g\\|_2,</math> Line 127: [[Convex analysis]] * [[Proximal gradient method]] * [[Regularization (mathematics)#~~Regularization~~Other uses of regularization in statistics and machine learning\|Regularization]] * [[Statistical learning theory]]