Proximal gradient methods for learning: Difference between revisions

Content deleted Content added
Monkbot (talk | contribs)
Monkbot (talk | contribs)
Line 2:
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
 
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal|last=Combettes|first=Patrick L.|author2=Wajs, Valérie R. |title=Signal Recovering by Proximal Forward-Backward Splitting|journal=Multiscale Model. Simul.|year=2005|volume=4|issue=4|pages=1168–1200|url=http://epubs.siam.org/doi/abs/10.1137/050626090}}</ref><ref name=structSparse>{{cite journal|last=Mosci|first=S.|coauthorsauthor2=Rosasco, L., |author3=Matteo, S., |author4=Verri, A., and |author5=Villa, S. |title=Solving Structured Sparsity Regularization with Proximal Methods|journal=Machine Learning and Knowledge Discovery in Databases|year=2010|volume=6322|pages=418–433}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[#Lasso regularization|lasso]]) or ''group structure'' (in the case of [[#Exploiting group structure|group lasso]]).
 
== Relevant background ==
Line 80:
 
Convergence of this fixed point scheme is well-studied in the literature<ref name=combettes /><ref name=daubechies /> and is guaranteed under appropriate choice of step size <math>\gamma</math> and loss function (such as the square loss taken here). [[Gradient descent#Extensions|Accelerated methods]] were introduced by Nesterov in 1983 which improve the rate of convergence under certain regularity assumptions on <math>F</math>.<ref name=nesterov>{{cite journal|last=Nesterov|first=Yurii|title=A method of solving a convex programming problem with convergence rate <math>O(1/k^2)</math>|journal=Soviet Math. Doklady|year=1983|volume=27|issue=2|pages=372–376}}</ref> Such methods have been studied extensively in previous years.<ref>{{cite book|last=Nesterov|first=Yurii|title=Introductory Lectures on Convex Optimization|year=2004|publisher=Kluwer Academic Publisher}}</ref>
For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal|last=Villa|first=S.|coauthorsauthor2=Salzo, S., |author3=Baldassarre, L., and |author4=Verri, A. |title=Accelerated and inexact forward-backward algorithms|journal=SIAM J. Optim.|year=2013|volume=23|issue=3|pages=1607–1633|doi=10.1137/110844805}}</ref>
 
== Practical considerations ==
 
There have been numerous developments within the past decade in [[convex optimization]] techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.<ref name=structSparse /><ref name=bach>{{cite journal|last=Bach|first=F.|coauthorsauthor2=Jenatton, R., |author3=Mairal, J., and |author4=Obozinski, Gl. |title=Optimization with sparsity-inducing penalties|journal=Found. & Trends Mach. Learn.|year=2011|volume=4|issue=1|pages=1–106|doi=10.1561/2200000015}}</ref>
 
=== Adaptive step size ===
Line 90:
In the fixed point iteration scheme
:<math>w^{k+1} = \operatorname{prox}_{\gamma R}\left(w^k-\gamma \nabla F\left(w^k\right)\right),</math>
one can allow variable step size <math>\gamma_k</math> instead of a constant <math>\gamma</math>. Numerous adaptive step size schemes have been proposed throughout the literature.<ref name=combettes /><ref name=bauschke /><ref>{{cite journal|last=Loris|first=I.|coauthors=Bertero, M., De Mol, C., Zanella, R., and Zanni, L.|title=Accelerating gradient projection methods for <math>\ell_1</math>-constrained signal recovery by steplength selection rules|journal=Applied & Comp. Harmonic Analysis|volume=27|issue=2|pages=247–254|year=2009}}</ref><ref>{{cite journal|last=Wright|first=S.J.|coauthorsauthor2=Nowak, R.D., and |author3=Figueiredo, M.A.T. |title=Sparse reconstruction by separable approximation|journal=IEEE Trans. Image Process.|year=2009|volume=57|issue=7|pages=2479–2493|doi=10.1109/TSP.2009.2016892}}</ref> Applications of these schemes<ref name=structSparse /><ref>{{cite journal|last=Loris|first=Ignace|title=On the performance of algorithms for the minimization of <math>\ell_1</math>-penalized functionals|journal=Inverse Problems|year=2009|volume=25|issue=3|doi=10.1088/0266-5611/25/3/035008}}</ref> suggest that these can offer substantial improvement in number of iterations required for fixed point convergence.
 
=== Elastic net (mixed norm regularization) ===
Line 122:
=== Other group structures ===
 
In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts.<ref>{{cite journal|last=Chen|first=X.|coauthorsauthor2=Lin, Q., |author3=Kim, S., |author4=Carbonell, J.G., and |author5=Xing, E.P. |title=Smoothing proximal gradient method for general structured sparse regression|journal=Ann. Appl. Stat.|year=2012|volume=6|issue=2|pages=719–752|doi=10.1214/11-AOAS514}}</ref><ref>{{cite journal|last=Mosci|first=S.|coauthorsauthor2=Villa, S., |author3=Verri, A., and |author4=Rosasco, L. |title=A primal-dual algorithm for group sparse regularization with overlapping groups|journal=NIPS|year=2010|volume=23|pages=2604–2612}}</ref><ref name=nest>{{cite journal|last=Jenatton|first=R.|coauthors=Audibert, J.-Y., and Bach, F.|title=Structured variable selection with sparsity-inducing norms|journal=J. Mach. Learn. Res.|year=2011|volume=12|pages=2777–2824}}</ref><ref>{{cite journal|last=Zhao|first=P.|coauthorsauthor2=Rocha, G., and |author3=Yu, B. |title=The composite absolute penalties family for grouped and hierarchical variable selection|journal=Ann. Statist.|year=2009|volume=37|issue=6A|pages=3468–3497|doi=10.1214/07-AOS584}}</ref> For overlapping groups one common approach is known as ''latent group lasso'' which introduces latent variables to account for overlap.<ref>{{cite journal|last=Obozinski|first=G.|coauthors=Laurent, J., and Vert, J.-P.|title=Group lasso with overlaps: the latent group lasso approach|journal=INRIA Technical Report|year=2011|url=http://hal.inria.fr/inria-00628498/en/}}</ref><ref>{{cite journal|last=Villa|first=S.|coauthorsauthor2=Rosasco, L., |author3=Mosci, S., and |author4=Verri, A. |title=Proximal methods for the latent group lasso penalty|journal=preprint|year=2012|url=http://arxiv.org/abs/1209.0368}}</ref> Nested group structures are studied in ''hierarchical structure prediction'' and with [[directed acyclic graph]]s.<ref name=nest />
 
== See also ==