Proximal gradient methods for learning: Difference between revisions

Content deleted Content added
Added free to read link in citations with OAbot #oabot
Alter: url, journal. Add: arxiv, url, bibcode, isbn, series. | You can use this tool yourself. Report bugs here. | via #UCB_Gadget
Line 2:
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
 
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal|last=Combettes|first=Patrick L.|author2=Wajs, Valérie R. |title=Signal Recovering by Proximal Forward-Backward Splitting|journal=Multiscale Model. Simul.|year=2005|volume=4|issue=4|pages=1168–1200|urldoi=http://epubs.siam.org/doi/abs/10.1137/050626090|doiurl=10https://semanticscholar.1137org/050626090paper/56974187b4d9a8757f4d8a6fd6facc8b4ad08240}}</ref><ref name=structSparse>{{cite journal|last=Mosci|first=S.|author2=Rosasco, L. |author3=Matteo, S. |author4=Verri, A. |author5=Villa, S. |title=Solving Structured Sparsity Regularization with Proximal Methods|journal=Machine Learning and Knowledge Discovery in Databases|year=2010|volume=6322|pages=418–433 |doi=10.1007/978-3-642-15883-4_27|series=Lecture Notes in Computer Science|isbn=978-3-642-15882-7}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[Lasso (statistics)|lasso]]) or ''group structure'' (in the case of [[Lasso (statistics)#Group LASSO|group lasso]]).
 
== Relevant background ==
Line 84:
== Practical considerations ==
 
There have been numerous developments within the past decade in [[convex optimization]] techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.<ref name=structSparse /><ref name=bach>{{cite journal|last=Bach|first=F.|author2=Jenatton, R. |author3=Mairal, J. |author4=Obozinski, Gl. |title=Optimization with sparsity-inducing penalties|journal=Found. &Foundations and Trends Mach.in Learn.Machine Learning|year=2011|volume=4|issue=1|pages=1–106|doi=10.1561/2200000015|arxiv=1108.0775|bibcode=2011arXiv1108.0775B}}</ref>
 
=== Adaptive step size ===
Line 105:
=== Group lasso ===
 
Group lasso is a generalization of the [[Lasso (statistics)|lasso method]] when features are grouped into disjoint blocks.<ref name=groupLasso>{{cite journal|last=Yuan|first=M.|author2=Lin, Y. |title=Model selection and estimation in regression with grouped variables|journal=J. R. Stat. Soc. B|year=2006|volume=68|issue=1|pages=49–67|doi=10.1111/j.1467-9868.2005.00532.x|url=https://semanticscholar.org/paper/d98ef875e2cbde3e2cc8fad521e3cbfe1bddbd69}}</ref> Suppose the features are grouped into blocks <math>\{w_1,\ldots,w_G\}</math>. Here we take as a regularization penalty
 
:<math>R(w) =\sum_{g=1}^G \|w_g\|_2,</math>
Line 122:
=== Other group structures ===
 
In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts.<ref>{{cite journal|last=Chen|first=X.|author2=Lin, Q. |author3=Kim, S. |author4=Carbonell, J.G. |author5=Xing, E.P. |title=Smoothing proximal gradient method for general structured sparse regression|journal=Ann. Appl. Stat.|year=2012|volume=6|issue=2|pages=719–752|doi=10.1214/11-AOAS514|arxiv=1005.4717}}</ref><ref>{{cite journal|last=Mosci|first=S.|author2=Villa, S. |author3=Verri, A. |author4=Rosasco, L. |title=A primal-dual algorithm for group sparse regularization with overlapping groups|journal=NIPS|year=2010|volume=23|pages=2604–2612}}</ref><ref name=nest>{{cite journal|last=Jenatton|first=R. |author2=Audibert, J.-Y. |author3=Bach, F. |title=Structured variable selection with sparsity-inducing norms|journal=J. Mach. Learn. Res.|year=2011|volume=12|pages=2777–2824|bibcode=2009arXiv0904.3523J |arxiv=0904.3523 }}</ref><ref>{{cite journal|last=Zhao|first=P.|author2=Rocha, G. |author3=Yu, B. |title=The composite absolute penalties family for grouped and hierarchical variable selection|journal=Ann. Stat.|year=2009|volume=37|issue=6A|pages=3468–3497|doi=10.1214/07-AOS584|arxiv=0909.0411|bibcode=2009arXiv0909.0411Z}}</ref> For overlapping groups one common approach is known as ''latent group lasso'' which introduces latent variables to account for overlap.<ref>{{cite journal|last=Obozinski|first=G. |author2=Laurent, J. |author3=Vert, J.-P. |title=Group lasso with overlaps: the latent group lasso approach|journal=INRIA Technical Report|year=2011|url=http://hal.inria.fr/inria-00628498/en/|bibcode=2011arXiv1110.0413O |arxiv=1110.0413 }}</ref><ref>{{cite journal|last=Villa|first=S.|author2=Rosasco, L. |author3=Mosci, S. |author4=Verri, A. |title=Proximal methods for the latent group lasso penalty|journal=preprintPreprint|year=2012|arxiv=1209.0368|bibcode=2012arXiv1209.0368V}}</ref> Nested group structures are studied in ''hierarchical structure prediction'' and with [[directed acyclic graph]]s.<ref name=nest />
 
== See also ==