Proximal gradient methods for learning: Difference between revisions

Content deleted Content added
Mgfbinae (talk | contribs)
Mgfbinae (talk | contribs)
No edit summary
Line 2:
 
'''Proximal gradient methods for learning''' is an area of research in [[statistical learning theory]] which studies algorithms for a general class of [[Regularization_(mathematics)|regularization]] problems where the regularization penalty may not be [[Convex_function#Definition|strictly convex]]. One such example is <math>\ell_1</math> regularization (also known as Lasso) of the form
:<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
 
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.<ref name=combettes>{{cite journal|last=Combettes|first=Patrick L.|coauthors=Wajs, Valérie R.|title=Signal Recovering by Proximal Forward-Backward Splitting|journal=Multiscale Model. Simul.|year=2005|volume=4|issue=4|pages=1168-1200|url=http://epubs.siam.org/doi/abs/10.1137/050626090}}</ref><ref name=structSparse>{{cite journal|last=Mosci|first=S.|coauthors=Rosasco, L., Matteo, S., Verri, A., and Villa, S.|title=Solving Structured Sparsity Regularization with Proximal Methods|journal=Machine Learning and Knowledge Discovery in Databases|year=2010|volume=6322|pages=418-433}}</ref> Such customized penalties can help to induce certain structure in problem solutions, such as ''sparsity'' (in the case of [[#Lasso_regularization|lasso]]) or ''group structure'' (in the case of [[#Exploiting_group_structure| group lasso]]).
Line 26:
 
Consider the [[Regularization_(mathematics)|regularized]] [[Empirical_risk_minimization|empirical risk minimization]] problem with square loss and with the [[L1-norm|<math>\ell_1</math> norm]] as the regularization penalty:
:<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, </math>
where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math> The <math>\ell_1</math> regularization penalty is sometimes referred to as <i>lasso</i> ([[Least_squares#Lasso_method|least absolute shrinkage and selection operator]]).<ref name=tibshirani /> Such <math>\ell_1</math> regularization problems are interesting because they induce <i> sparse</i> solutions, that is, solutions <math>w</math> to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem
:<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_0, </math>
where <math>\|w\|_0</math> denotes the <math>\ell_0</math> "norm", which is the number of nonzero entries of the vector <math>w</math>. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.<ref name=tibshirani>{{cite journal|last=Tibshirani|first=R.|title=Regression shrinkage and selection via the lasso|journal=J. R. Stat. Soc., Ser. B|year=1996|volume=58|series=1|issue=1|pages=267-288}}</ref>
 
Line 34:
 
For simplicity we restrict our attention to the problem where <math>\lambda=1</math>. To solve the problem
:<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \|w\|_1, </math>
we consider our objective function in two parts: a convex, differentiable term <math>F(w) = \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2</math> and a convex function <math>R(w) = \|w\|_1</math>. Note that <math>R</math> is not strictly convex.
 
Line 68:
:<math>x^* = \operatorname{prox}_{\gamma R}\left(x^*-\gamma\nabla F(x^*)\right).</math>
 
Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial <math>w^0\in\mathbb{R}^nd</math>, and for <math>k=1,2,\ldots</math> define
:<math>w^{k+1} = S_{\gamma}\left(w^k - \gamma \nabla F\left(w^k\right)\right).</math>
Note here the effective trade-off between the empirical error term <math>F(w) </math> and the regularization penality <math>R(w)</math>. This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step (<math> w^k - \gamma \nabla F\left(w^k\right)</math>) and a soft thresholding step (via <math>S_\gamma</math>).
Line 95:
 
[[Elastic_net_regularization | Elastic net regularization]] offers an alternative to pure <math>\ell_1</math> regularization. The problem of lasso (<math>\ell_1</math>) regularization involves the penalty term <math>R(w) = \|w\|_1</math>, which is not strictly convex. Hence, solutions to <math>\min_w F(w) + R(w),</math> where <math>F</math> is some empirical loss function, need not be unique. This is often avoided by the inclusion of an additional strictly convex term, such as an <math>\ell_2</math> norm regularization penalty. For example, one can consider the problem
:<math>\min_{w\in\mathbb{R}^nd} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2\right), </math>
where <math>x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.</math>
For <math>0<\mu\leq 1</math> the penalty term <math>\lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2\right)</math> is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small <math>\mu > 0</math>, the additional penalty term <math>\mu \|w\|_2</math> acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.<ref name=structSparse /><ref name=deMolElasticNet>{{cite journal|last=De Mol|first=C.|coauthors=De Vito, E., and Rosasco, L.|title=Elastic-net regularization in learning theory|journal=J. Complexity|year=2009|volume=25|issue=2|pages=201-230|doi=10.1016/j.jco.2009.01.002}}</ref>