Revision as of 20:55, 7 December 2013 edit Mgfbinae (talk \| contribs) 122 edits →Lasso regularization ← Previous edit		Revision as of 02:23, 9 December 2013 edit undo Mgfbinae (talk \| contribs) 122 edits No edit summary Next edit →
Line 2: '''Proximal gradient methods for learning''' is an area of research in [[statistical learning theory]] which studies algorithms for a general class of [[Regularization_(mathematics)\|regularization]] problems where the regularization penalty may not be [[Convex_function#Definition\|strictly convex]]. One such example is <math>\ell_1</math> regularization (also known as Lasso) of the form :<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \\|w\\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal\|last=Combettes\|first=Patrick L.\|coauthors=Wajs, Valérie R.\|title=Signal Recovering by Proximal Forward-Backward Splitting\|journal=Multiscale Model. Simul.\|year=2005\|volume=4\|issue=4\|pages=1168-1200\|url=http://epubs.siam.org/doi/abs/10.1137/050626090}}</ref><ref name=structSparse>{{cite journal\|last=Mosci\|first=S.\|coauthors=Rosasco, L., Matteo, S., Verri, A., and Villa, S.\|title=Solving Structured Sparsity Regularization with Proximal Methods\|journal=Machine Learning and Knowledge Discovery in Databases\|year=2010\|volume=6322\|pages=418-433}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[#Lasso_regularization\|lasso]]) or ''group structure'' (in the case of [[#Exploiting_group_structure\| group lasso]]). Line 26: Consider the [[Regularization_(mathematics)\|regularized]] [[Empirical_risk_minimization\|empirical risk minimization]] problem with square loss and with the [[L1-norm\|<math>\ell_1</math> norm]] as the regularization penalty: :<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \\|w\\|_1, </math> where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> The <math>\ell_1</math> regularization penalty is sometimes referred to as <i>lasso</i> ([[Least_squares#Lasso_method\|least absolute shrinkage and selection operator]]).<ref name=tibshirani /> Such <math>\ell_1</math> regularization problems are interesting because they induce <i> sparse</i> solutions, that is, solutions <math>w</math> to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem :<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \\|w\\|_0, </math> where <math>\\|w\\|_0</math> denotes the <math>\ell_0</math> "norm", which is the number of nonzero entries of the vector <math>w</math>. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.<ref name=tibshirani>{{cite journal\|last=Tibshirani\|first=R.\|title=Regression shrinkage and selection via the lasso\|journal=J. R. Stat. Soc., Ser. B\|year=1996\|volume=58\|series=1\|issue=1\|pages=267-288}}</ref> Line 34: For simplicity we restrict our attention to the problem where <math>\lambda=1</math>. To solve the problem :<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \\|w\\|_1, </math> we consider our objective function in two parts: a convex, differentiable term <math>F(w) = \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2</math> and a convex function <math>R(w) = \\|w\\|_1</math>. Note that <math>R</math> is not strictly convex. Line 68: :<math>x^* = \operatorname{prox}_{\gamma R}\left(x^-\gamma\nabla F(x^)\right).</math> Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial <math>w^0\in\mathbb{R}^nd</math>, and for <math>k=1,2,\ldots</math> define :<math>w^{k+1} = S_{\gamma}\left(w^k - \gamma \nabla F\left(w^k\right)\right).</math> Note here the effective trade-off between the empirical error term <math>F(w) </math> and the regularization penality <math>R(w)</math>. This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step (<math> w^k - \gamma \nabla F\left(w^k\right)</math>) and a soft thresholding step (via <math>S_\gamma</math>). Line 95: [[Elastic_net_regularization \| Elastic net regularization]] offers an alternative to pure <math>\ell_1</math> regularization. The problem of lasso (<math>\ell_1</math>) regularization involves the penalty term <math>R(w) = \\|w\\|_1</math>, which is not strictly convex. Hence, solutions to <math>\min_w F(w) + R(w),</math> where <math>F</math> is some empirical loss function, need not be unique. This is often avoided by the inclusion of an additional strictly convex term, such as an <math>\ell_2</math> norm regularization penalty. For example, one can consider the problem :<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\\|w\\|_1+\mu \\|w\\|_2\right), </math> where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> For <math>0<\mu\leq 1</math> the penalty term <math>\lambda \left((1-\mu)\\|w\\|_1+\mu \\|w\\|_2\right)</math> is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small <math>\mu > 0</math>, the additional penalty term <math>\mu \\|w\\|_2</math> acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.<ref name=structSparse /><ref name=deMolElasticNet>{{cite journal\|last=De Mol\|first=C.\|coauthors=De Vito, E., and Rosasco, L.\|title=Elastic-net regularization in learning theory\|journal=J. Complexity\|year=2009\|volume=25\|issue=2\|pages=201-230\|doi=10.1016/j.jco.2009.01.002}}</ref>

Proximal gradient methods for learning: Difference between revisions