Proximal gradient methods for learning: Difference between revisions

Content deleted Content added
m replace/remove deprecated cs1|2 parameters; using AWB
T9520 (talk | contribs)
m Corrected links to Lasso page.
Line 2:
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
 
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal|last=Combettes|first=Patrick L.|author2=Wajs, Valérie R. |title=Signal Recovering by Proximal Forward-Backward Splitting|journal=Multiscale Model. Simul.|year=2005|volume=4|issue=4|pages=1168–1200|url=http://epubs.siam.org/doi/abs/10.1137/050626090|doi=10.1137/050626090}}</ref><ref name=structSparse>{{cite journal|last=Mosci|first=S.|author2=Rosasco, L. |author3=Matteo, S. |author4=Verri, A. |author5=Villa, S. |title=Solving Structured Sparsity Regularization with Proximal Methods|journal=Machine Learning and Knowledge Discovery in Databases|year=2010|volume=6322|pages=418–433 |doi=10.1007/978-3-642-15883-4_27}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[#Lasso regularization(statistics)|lasso]]) or ''group structure'' (in the case of [[Lasso_(statistics)#Exploiting group structureGroup_LASSO|group lasso]]).
 
== Relevant background ==
Line 26:
which for <math>\gamma=1</math> implies that <math>x = \operatorname{prox}_{\varphi}(x)+\operatorname{prox}_{\varphi^*}(x)</math>.<ref name=combettes /><ref name=moreau>{{cite journal|last=Moreau|first=J.-J.|title=Fonctions convexes duales et points proximaux dans un espace hilbertien|journal=C. R. Acad. Sci. Paris Ser. A Math.|year=1962|volume=255|pages=2987-2899}}</ref> The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a vector space, analogous with the fact that proximity operators are generalizations of projections.<ref name=combettes />
 
In certain situations it may be easier to compute the proximity operator for the conjugate <math>\varphi^*</math> instead of the function <math>\varphi</math>, and therefore the Moreau decomposition can be applied. This is the case for [[Lasso_(statistics)#Exploiting group structureGroup_LASSO|group lasso]].
 
== Lasso regularization ==
Line 32:
Consider the [[Regularization (mathematics)|regularized]] [[empirical risk minimization]] problem with square loss and with the [[L1-norm|<math>\ell_1</math> norm]] as the regularization penalty:
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, </math>
where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> The <math>\ell_1</math> regularization problem is sometimes referred to as ''lasso'' ([[Least squares#Lasso method(statistics)|least absolute shrinkage and selection operator]]).<ref name=tibshirani /> Such <math>\ell_1</math> regularization problems are interesting because they induce '' sparse'' solutions, that is, solutions <math>w</math> to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem
:<math>\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_0, </math>
where <math>\|w\|_0</math> denotes the <math>\ell_0</math> "norm", which is the number of nonzero entries of the vector <math>w</math>. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.<ref name=tibshirani>{{cite journal|last=Tibshirani|first=R.|title=Regression shrinkage and selection via the lasso|journal=J. R. Stat. Soc., Ser. B|year=1996|volume=58|series=1|issue=1|pages=267–288}}</ref>
Line 105:
=== Group lasso ===
 
Group lasso is a generalization of the [[#Lasso regularization(statistics)|lasso method]] when features are grouped into disjoint blocks.<ref name=groupLasso>{{cite journal|last=Yuan|first=M.|author2=Lin, Y. |title=Model selection and estimation in regression with grouped variables|journal=J. R. Stat. Soc. B|year=2006|volume=68|issue=1|pages=49–67|doi=10.1111/j.1467-9868.2005.00532.x}}</ref> Suppose the features are grouped into blocks <math>\{w_1,\ldots,w_G\}</math>. Here we take as a regularization penalty
 
:<math>R(w) =\sum_{g=1}^G \|w_g\|_2,</math>