Revision as of 22:45, 23 July 2014 edit Monkbot (talk \| contribs) Bots 3,695,952 edits m Task 5: Fix CS1 deprecated coauthor parameter errors ← Previous edit		Revision as of 17:38, 25 July 2014 edit undo Monkbot (talk \| contribs) Bots 3,695,952 edits m Task 4: Fix CS1 deprecated coauthor parameter errors Next edit →
Line 2: :<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \\|w\\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal\|last=Combettes\|first=Patrick L.\|author2=Wajs, Valérie R. \|title=Signal Recovering by Proximal Forward-Backward Splitting\|journal=Multiscale Model. Simul.\|year=2005\|volume=4\|issue=4\|pages=1168–1200\|url=http://epubs.siam.org/doi/abs/10.1137/050626090}}</ref><ref name=structSparse>{{cite journal\|last=Mosci\|first=S.\|~~coauthors~~author2=Rosasco, L., \|author3=Matteo, S., \|author4=Verri, A.~~, and~~ \|author5=Villa, S. \|title=Solving Structured Sparsity Regularization with Proximal Methods\|journal=Machine Learning and Knowledge Discovery in Databases\|year=2010\|volume=6322\|pages=418–433}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[#Lasso regularization\|lasso]]) or ''group structure'' (in the case of [[#Exploiting group structure\|group lasso]]). == Relevant background == Line 80: Convergence of this fixed point scheme is well-studied in the literature<ref name=combettes /><ref name=daubechies /> and is guaranteed under appropriate choice of step size <math>\gamma</math> and loss function (such as the square loss taken here). [[Gradient descent#Extensions\|Accelerated methods]] were introduced by Nesterov in 1983 which improve the rate of convergence under certain regularity assumptions on <math>F</math>.<ref name=nesterov>{{cite journal\|last=Nesterov\|first=Yurii\|title=A method of solving a convex programming problem with convergence rate <math>O(1/k^2)</math>\|journal=Soviet Math. Doklady\|year=1983\|volume=27\|issue=2\|pages=372–376}}</ref> Such methods have been studied extensively in previous years.<ref>{{cite book\|last=Nesterov\|first=Yurii\|title=Introductory Lectures on Convex Optimization\|year=2004\|publisher=Kluwer Academic Publisher}}</ref> For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal\|last=Villa\|first=S.\|~~coauthors~~author2=Salzo, S., \|author3=Baldassarre, L.~~, and~~ \|author4=Verri, A. \|title=Accelerated and inexact forward-backward algorithms\|journal=SIAM J. Optim.\|year=2013\|volume=23\|issue=3\|pages=1607–1633\|doi=10.1137/110844805}}</ref> == Practical considerations == There have been numerous developments within the past decade in [[convex optimization]] techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.<ref name=structSparse /><ref name=bach>{{cite journal\|last=Bach\|first=F.\|~~coauthors~~author2=Jenatton, R., \|author3=Mairal, J.~~, and~~ \|author4=Obozinski, Gl. \|title=Optimization with sparsity-inducing penalties\|journal=Found. & Trends Mach. Learn.\|year=2011\|volume=4\|issue=1\|pages=1–106\|doi=10.1561/2200000015}}</ref> === Adaptive step size === Line 90: In the fixed point iteration scheme :<math>w^{k+1} = \operatorname{prox}_{\gamma R}\left(w^k-\gamma \nabla F\left(w^k\right)\right),</math> one can allow variable step size <math>\gamma_k</math> instead of a constant <math>\gamma</math>. Numerous adaptive step size schemes have been proposed throughout the literature.<ref name=combettes /><ref name=bauschke /><ref>{{cite journal\|last=Loris\|first=I.\|coauthors=Bertero, M., De Mol, C., Zanella, R., and Zanni, L.\|title=Accelerating gradient projection methods for <math>\ell_1</math>-constrained signal recovery by steplength selection rules\|journal=Applied & Comp. Harmonic Analysis\|volume=27\|issue=2\|pages=247–254\|year=2009}}</ref><ref>{{cite journal\|last=Wright\|first=S.J.\|~~coauthors~~author2=Nowak, R.D.~~, and~~ \|author3=Figueiredo, M.A.T. \|title=Sparse reconstruction by separable approximation\|journal=IEEE Trans. Image Process.\|year=2009\|volume=57\|issue=7\|pages=2479–2493\|doi=10.1109/TSP.2009.2016892}}</ref> Applications of these schemes<ref name=structSparse /><ref>{{cite journal\|last=Loris\|first=Ignace\|title=On the performance of algorithms for the minimization of <math>\ell_1</math>-penalized functionals\|journal=Inverse Problems\|year=2009\|volume=25\|issue=3\|doi=10.1088/0266-5611/25/3/035008}}</ref> suggest that these can offer substantial improvement in number of iterations required for fixed point convergence. === Elastic net (mixed norm regularization) === Line 122: === Other group structures === In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts.<ref>{{cite journal\|last=Chen\|first=X.\|~~coauthors~~author2=Lin, Q., \|author3=Kim, S., \|author4=Carbonell, J.G.~~, and~~ \|author5=Xing, E.P. \|title=Smoothing proximal gradient method for general structured sparse regression\|journal=Ann. Appl. Stat.\|year=2012\|volume=6\|issue=2\|pages=719–752\|doi=10.1214/11-AOAS514}}</ref><ref>{{cite journal\|last=Mosci\|first=S.\|~~coauthors~~author2=Villa, S., \|author3=Verri, A.~~, and~~ \|author4=Rosasco, L. \|title=A primal-dual algorithm for group sparse regularization with overlapping groups\|journal=NIPS\|year=2010\|volume=23\|pages=2604–2612}}</ref><ref name=nest>{{cite journal\|last=Jenatton\|first=R.\|coauthors=Audibert, J.-Y., and Bach, F.\|title=Structured variable selection with sparsity-inducing norms\|journal=J. Mach. Learn. Res.\|year=2011\|volume=12\|pages=2777–2824}}</ref><ref>{{cite journal\|last=Zhao\|first=P.\|~~coauthors~~author2=Rocha, G.~~, and~~ \|author3=Yu, B. \|title=The composite absolute penalties family for grouped and hierarchical variable selection\|journal=Ann. Statist.\|year=2009\|volume=37\|issue=6A\|pages=3468–3497\|doi=10.1214/07-AOS584}}</ref> For overlapping groups one common approach is known as ''latent group lasso'' which introduces latent variables to account for overlap.<ref>{{cite journal\|last=Obozinski\|first=G.\|coauthors=Laurent, J., and Vert, J.-P.\|title=Group lasso with overlaps: the latent group lasso approach\|journal=INRIA Technical Report\|year=2011\|url=http://hal.inria.fr/inria-00628498/en/}}</ref><ref>{{cite journal\|last=Villa\|first=S.\|~~coauthors~~author2=Rosasco, L., \|author3=Mosci, S.~~, and~~ \|author4=Verri, A. \|title=Proximal methods for the latent group lasso penalty\|journal=preprint\|year=2012\|url=http://arxiv.org/abs/1209.0368}}</ref> Nested group structures are studied in ''hierarchical structure prediction'' and with [[directed acyclic graph]]s.<ref name=nest /> == See also ==

Proximal gradient methods for learning: Difference between revisions