Revision as of 16:35, 30 July 2021 edit Jdreyf (talk \| contribs) 76 edits →Relevant background ← Previous edit		Revision as of 14:17, 30 January 2022 edit undo Citation bot (talk \| contribs) Bots 5,863,626 edits Alter: template type. Add: s2cid. \| Use this bot. Report bugs. \| Suggested by AManWithNoPlan \| #UCB_webform 628/1682 Next edit →
Line 68: \end{cases}</math> which is known as the [[Thresholding (image processing)\|soft thresholding]] operator <math>S_{\gamma}(x)=\operatorname{prox}_{\gamma \\|\cdot\\|_1}(x)</math>.<ref name=combettes /><ref name=daubechies>{{cite journal\|last=Daubechies\|first=I. \|author2=Defrise, M. \|author3=De Mol, C.\|author3-link= Christine De Mol \|title=An iterative thresholding algorithm for linear inverse problem with a sparsity constraint\|journal=Comm. Pure Appl. Math.\|year=2004\|volume=57\|issue=11\|pages=1413–1457\|doi=10.1002/cpa.20042\|arxiv=math/0307152\|s2cid=1438417 }}</ref> === Fixed point iterative schemes === Line 84: == Practical considerations == There have been numerous developments within the past decade in [[convex optimization]] techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.<ref name=structSparse /><ref name=bach>{{cite journal\|last=Bach\|first=F.\|author2=Jenatton, R. \|author3=Mairal, J. \|author4=Obozinski, Gl. \|title=Optimization with sparsity-inducing penalties\|journal= Foundations and Trends in Machine Learning\|year=2011\|volume=4\|issue=1\|pages=1–106\|doi=10.1561/2200000015\|arxiv=1108.0775\|bibcode=2011arXiv1108.0775B\|s2cid=56356708}}</ref> === Adaptive step size === Line 90: In the fixed point iteration scheme :<math>w^{k+1} = \operatorname{prox}_{\gamma R}\left(w^k-\gamma \nabla F\left(w^k\right)\right),</math> one can allow variable step size <math>\gamma_k</math> instead of a constant <math>\gamma</math>. Numerous adaptive step size schemes have been proposed throughout the literature.<ref name=combettes /><ref name=bauschke /><ref>{{cite journal\|last=Loris\|first=I. \|author2=Bertero, M. \|author3=De Mol, C. \|author4=Zanella, R. \|author5=Zanni, L. \|title=Accelerating gradient projection methods for <math>\ell_1</math>-constrained signal recovery by steplength selection rules\|journal=Applied & Comp. Harmonic Analysis\|volume=27\|issue=2\|pages=247–254\|year=2009\|doi=10.1016/j.acha.2009.02.003\|arxiv=0902.4424 \|s2cid=18093882 }}</ref><ref>{{cite journal\|last=Wright\|first=S.J.\|author2=Nowak, R.D. \|author3=Figueiredo, M.A.T. \|title=Sparse reconstruction by separable approximation\|journal=IEEE Trans. Image Process.\|year=2009\|volume=57\|issue=7\|pages=2479–2493\|doi=10.1109/TSP.2009.2016892\|bibcode=2009ITSP...57.2479W\|citeseerx=10.1.1.115.9334}}</ref> Applications of these schemes<ref name=structSparse /><ref>{{cite journal\|last=Loris\|first=Ignace\|title=On the performance of algorithms for the minimization of <math>\ell_1</math>-penalized functionals\|journal=Inverse Problems\|year=2009\|volume=25\|issue=3\|doi=10.1088/0266-5611/25/3/035008\|page=035008\|arxiv=0710.4082\|bibcode=2009InvPr..25c5008L\|s2cid=14213443}}</ref> suggest that these can offer substantial improvement in number of iterations required for fixed point convergence. === Elastic net (mixed norm regularization) === Line 97: :<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\\|w\\|_1+\mu \\|w\\|_2^2\right), </math> where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> For <math>0<\mu\leq 1</math> the penalty term <math>\lambda \left((1-\mu)\\|w\\|_1+\mu \\|w\\|_2^2\right)</math> is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small <math>\mu > 0</math>, the additional penalty term <math>\mu \\|w\\|_2^2</math> acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.<ref name=structSparse /><ref name=deMolElasticNet>{{cite journal\|last=De Mol\|first=C. \|author2=De Vito, E. \|author3=Rosasco, L.\|title=Elastic-net regularization in learning theory\|journal=J. Complexity\|year=2009\|volume=25\|issue=2\|pages=201–230\|doi=10.1016/j.jco.2009.01.002\|arxiv=0807.3423\|s2cid=7167292 }}</ref> == Exploiting group structure == Line 122: === Other group structures === In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts.<ref>{{cite journal\|last=Chen\|first=X.\|author2=Lin, Q. \|author3=Kim, S. \|author4=Carbonell, J.G. \|author5=Xing, E.P. \|title=Smoothing proximal gradient method for general structured sparse regression\|journal=Ann. Appl. Stat.\|year=2012\|volume=6\|issue=2\|pages=719–752\|doi=10.1214/11-AOAS514\|arxiv=1005.4717\|s2cid=870800}}</ref><ref>{{cite journal\|last=Mosci\|first=S.\|author2=Villa, S. \|author3=Verri, A. \|author4=Rosasco, L. \|title=A primal-dual algorithm for group sparse regularization with overlapping groups\|journal=NIPS\|year=2010\|volume=23\|pages=2604–2612}}</ref><ref name=nest>{{cite journal\|last=Jenatton\|first=R. \|author2=Audibert, J.-Y. \|author3=Bach, F. \|title=Structured variable selection with sparsity-inducing norms\|journal=J. Mach. Learn. Res.\|year=2011\|volume=12\|pages=2777–2824\|bibcode=2009arXiv0904.3523J \|arxiv=0904.3523 }}</ref><ref>{{cite journal\|last=Zhao\|first=P.\|author2=Rocha, G. \|author3=Yu, B. \|title=The composite absolute penalties family for grouped and hierarchical variable selection\|journal=Ann. Stat.\|year=2009\|volume=37\|issue=6A\|pages=3468–3497\|doi=10.1214/07-AOS584\|arxiv=0909.0411\|bibcode=2009arXiv0909.0411Z\|s2cid=9319285}}</ref> For overlapping groups one common approach is known as ''latent group lasso'' which introduces latent variables to account for overlap.<ref>{{cite ~~arxiv~~arXiv \|eprint=1110.0413 \|last1=Obozinski \|first1=Guillaume \|title=Group Lasso with Overlaps: The Latent Group Lasso approach \|last2=Jacob \|first2=Laurent \|last3=Vert \|first3=Jean-Philippe \|class=stat.ML \|year=2011 }}</ref><ref>{{cite ~~arxiv~~arXiv \|eprint=1209.0368\|last1=Villa\|first1=Silvia\|title=Proximal methods for the latent group lasso penalty\|last2=Rosasco\|first2=Lorenzo\|last3=Mosci\|first3=Sofia\|last4=Verri\|first4=Alessandro\|class=math.OC\|year=2012}}</ref> Nested group structures are studied in ''hierarchical structure prediction'' and with [[directed acyclic graph]]s.<ref name=nest /> == See also ==

Proximal gradient methods for learning: Difference between revisions