Proximal gradient methods for learning: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:03, 13 May 2024 edit Citation bot (talk \| contribs) Bots 5,863,301 edits Added doi. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Category:Machine learning \| #UCB_Category 154/230 ← Previous edit		Latest revision as of 18:58, 29 July 2025 edit undo Lynch44 (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers, Temporary account IP viewers 15,728 edits m Reverted edit by 113.184.80.74 (talk) to last version by Esg08 Tags: Rollback Mobile edit Mobile web edit
(2 intermediate revisions by 2 users not shown)
Line 24: The general form of Moreau's decomposition states that for any <math>x\in\mathcal{X}</math> and any <math>\gamma>0</math> that :<math>x = \operatorname{prox}_{\gamma \varphi}(x) + \gamma\operatorname{prox}_{\varphi^/\gamma}(x/\gamma),</math> which for <math>\gamma=1</math> implies that <math>x = \operatorname{prox}_{\varphi}(x)+\operatorname{prox}_{\varphi^}(x)</math>.<ref name=combettes /><ref name=moreau>{{cite journal\|last=Moreau\|first=J.-J.\|title=Fonctions convexes duales et points proximaux dans un espace hilbertien\|journal=Comptes Rendus de l'Académie des Sciences, Série A\|year=1962\|volume=255\|pages=2897–2899\|mr=144188\|zbl=0118.10502}}</ref> The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a [[vector space]], analogous with the fact that proximity operators are generalizations of projections.<ref name=combettes /> In certain situations it may be easier to compute the proximity operator for the conjugate <math>\varphi^*</math> instead of the function <math>\varphi</math>, and therefore the Moreau decomposition can be applied. This is the case for [[Lasso (statistics)#Group LASSO\|group lasso]]. Line 79: Note here the effective trade-off between the empirical error term <math>F(w) </math> and the regularization penalty <math>R(w)</math>. This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step (<math> w^k - \gamma \nabla F\left(w^k\right)</math>) and a soft thresholding step (via <math>S_\gamma</math>). Convergence of this fixed point scheme is well-studied in the literature<ref name=combettes /><ref name=daubechies /> and is guaranteed under appropriate choice of step size <math>\gamma</math> and [[loss function]] (such as the square loss taken here). [[Gradient descent#Extensions\|Accelerated methods]] were introduced by Nesterov in 1983 which improve the [[rate of convergence]] under certain regularity assumptions on <math>F</math>.<ref name=nesterov>{{cite journal\|last=Nesterov\|first=Yurii\|title=A method of solving a convex programming problem with convergence rate <math>O(1/k^2)</math>\|journal=Soviet Mathematics - Doklady\|year=1983\|volume=27\|issue=2\|pages=372–376}}</ref> Such methods have been studied extensively in previous years.<ref>{{cite book\|last=Nesterov\|first=Yurii\|title=Introductory Lectures on Convex Optimization\|year=2004\|publisher=Kluwer Academic Publisher}}</ref> For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term <math>R</math>, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.<ref name=bauschke /><ref>{{cite journal\|last=Villa\|first=S.\|author2=Salzo, S. \|author3=Baldassarre, L. \|author4=Verri, A. \|title=Accelerated and inexact forward-backward algorithms\|journal=SIAM J. Optim.\|year=2013\|volume=23\|issue=3\|pages=1607–1633\|doi=10.1137/110844805\|citeseerx=10.1.1.416.3633\|s2cid=11379846 }}</ref>