Sharpness aware minimization: Difference between revisions

Content deleted Content added
Declining submission: npov - Submission is not written in a formal, neutral encyclopedic tone (AFCH)
No edit summary
Line 9:
{{technical}}
 
'''Sharpness Aware Minimization''' ('''SAM''') is an [[optimization algorithm]] designedused in [[machine learning]] that aims to improve themodel [[generalization (machine learning)|generalization performance]]. ofThe [[machinemethod learning]]seeks models,to particularlyfind [[deepmodel neuralparameters network]]s.that Insteadare located in regions of merelythe seekingloss parameterslandscape thatwith achieveuniformly low training [[loss function|loss]]values, SAMrather aims to findthan parameters that resideonly inachieve ''neighborhoods''a ofminimal uniformlyloss lowvalue loss,at a single point. This approach is described effectivelyas favoringfinding "flat" minima ininstead the loss landscape overof "sharp" ones. The intuitionrationale is that models convergingtrained tothis flatter minimaway are moreless robustsensitive to variations between training and test [[data set|data distributions]], leadingwhich can lead to better generalizationperformance on unseen data.<ref name="Foret2021">{{cite conference |last1=Foret |first1=Pierre |last2=Kleiner |first2=Ariel |last3=Mobahi |first3=Hossein |last4=Neyshabur |first4=Behnam |title=Sharpness-Aware Minimization for Efficiently Improving Generalization |book-title=International Conference on Learning Representations (ICLR) 2021 |year=2021 |arxiv=2010.01412 |url=https://openreview.net/forum?id=6Tm1m_rRrwY}}</ref>
 
SAMThe algorithm was introduced in a 2020 paper by Foreta etteam al.of inresearchers 2020including inPierre theForet, paperAriel "Sharpness-AwareKleiner, MinimizationHossein forMobahi, Efficientlyand ImprovingBehnam Generalization"Neyshabur.<ref name="Foret2021"/>
 
== Underlying Principle ==
SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value:<ref name="Foret2021"/>
 
==Core Idea and Mechanism==
The core idea of SAM is to minimize a "sharpness-aware" loss function. This is typically formulated as a minimax problem:<ref name="Foret2021"/>
<math>\min_{w} \max_{\|\epsilon\|_p \le \rho} L_{\text{train}}(w + \epsilon) + \lambda \|w\|_2^2</math>
where:
* <math>w</math> are the model parameters.
* <math>L_{\text{train}}</math> is the training loss.
* <math>\epsilon</math> is an adversarial perturbation.
* <math>\rho</math> is a [[hyperparameter (machine learning)|hyperparameter]] defining the size of the neighborhood (<math>L_p</math> ball) around <math>w</math>.
* The inner maximization finds the perturbation <math>\epsilon</math> that maximizes the loss within the <math>\rho</math>-neighborhood.
* The outer minimization updates the weights <math>w</math> to minimize this maximized loss.
* An optional [[Regularization (mathematics)|L2 regularization]] term can also be included.
 
In this formulation:
In practice, solving the inner maximization problem exactly is often intractable. SAM approximates the solution by performing a single [[gradient ascent]] step to find the adversarial perturbation <math>\epsilon</math>:<ref name="Foret2021"/>
* <math>w</math> arerepresents the model's parameters (weights).
* <math>L_{\text{train}}</math> is the training[[loss function|loss]] calculated on the training data.
* <math>\epsilon</math> is an adversariala perturbation applied to the weights.
* <math>\rho</math> is a [[hyperparameter (machine learning)|hyperparameter]] definingthat defines the sizeradius of the neighborhood (an <math>L_p</math> ball) aroundto <math>w</math>search for the highest loss.
* An optional [[Regularization (mathematics)|L2 regularization]] term, canscaled alsoby <math>\lambda</math>, can be included.
 
InA practice,direct solvingsolution to the inner maximization problem exactly is oftencomputationally intractableexpensive. SAM approximates the solutionit by performingtaking a single [[gradient ascent]] step to find the adversarial perturbation <math>\epsilon</math>. This is calculated as:<ref name="Foret2021"/>
 
<math>\epsilon(w) = \rho \frac{\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_2}</math>
 
The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, <math>w_{\text{adv}} = w + \epsilon(w)</math>, by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights <math>w</math> using the gradient calculated at these perturbed weights, <math>\nabla L_{\text{train}}(w_{\text{adv}})</math>. This update is typically performed using a standard optimizer like [[Stochastic gradient descent|SGD]] or [[Adam (optimization algorithm)|Adam]].<ref name="Foret2021"/>
The SAM optimizer then typically performs two steps per iteration:<ref name="Foret2021"/>
# '''Ascent Step (Finding "Sharp" Weights):''' Calculate the [[gradient]] <math>\nabla L_{\text{train}}(w)</math> and compute the adversarial weights <math>w_{\text{adv}} = w + \epsilon(w)</math>.
# '''Descent Step (Updating Original Weights):''' Compute the gradient <math>\nabla L_{\text{train}}(w_{\text{adv}})</math> using the adversarial weights and update the original weights <math>w</math> using this gradient, typically with a base optimizer like [[Stochastic gradient descent|SGD]] or [[Adam (optimization algorithm)|Adam]].
 
== Application and Performance ==
This process encourages the model to converge to regions where the loss remains low even when small perturbations are applied to the weights.
*SAM '''Improvedhas Generalization:'''been SAMapplied consistentlyin leadsvarious tomachine betterlearning generalizationcontexts, performanceprimarily acrossin a[[computer widevision]]. rangeResearch ofhas deepshown learningit can improve generalization performance in models (especiallysuch as [[Convolutional Neural Network|Convolutional Neural Networks (CNNs)]] and [[Transformer (machine learning model)|Vision Transformers (ViTs)]]) andon image datasets (e.g.,including [[ImageNet]], [[CIFAR-10]], and [[CIFAR-100]] dataset|CIFAR-100]]]).<ref name="Foret2021"/>
 
The algorithm has also been found to be effective in training models with [[Label noise|noisy labels]], where it performs comparably to methods designed specifically for this problem.<ref name="Wen2021Mitigating">{{cite arXiv |last1=Wen |first1=Yulei |last2=Liu |first2=Zhen |last3=Zhang |first3=Zhe |last4=Zhang |first4=Yilong |last5=Wang |first5=Linmi |last6=Zhang |first6=Tiantian |title=Mitigating Memorization in Sample Selection for Learning with Noisy Labels |eprint=2110.08529 |year=2021 |class=cs.LG}}</ref><ref name="Zhuang2022Surrogate">{{cite conference |last1=Zhuang |first1=Juntang |last2=Gong |first2=Ming |last3=Liu |first3=Tong |title=Surrogate Gap Minimization Improves Sharpness-Aware Training |book-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=27098–27115 |publisher=PMLR |url=https://proceedings.mlr.press/v162/zhuang22d.html}}</ref> Some studies indicate that SAM and its variants can improve [[Out-of-distribution generalization|out-of-distribution (OOD) generalization]], which is a model's ability to perform well on data from distributions not seen during training.<ref name="Croce2021SAMBayes">{{cite arXiv |last1=Croce |first1=Francesco |last2=Hein |first2=Matthias |title=SAM as an Optimal Relaxation of Bayes |eprint=2110.11214 |year=2021 |class=cs.LG}}</ref><ref name="Kim2022Slicing">{{cite conference |last1=Kim |first1=Daehyeon |last2=Kim |first2=Seungone |last3=Kim |first3=Kwangrok |last4=Kim |first4=Sejun |last5=Kim |first5=Jangho |title=Slicing Aided Hyper-dimensional Inference and Fine-tuning for Improved OOD Generalization |book-title=Conference on Neural Information Processing Systems (NeurIPS) 2022 |year=2022 |url=https://openreview.net/forum?id=fN0K3jtnQG_}}</ref> Other areas where it has been applied include gradual [[___domain adaptation]] and mitigating [[overfitting]] in scenarios with repeated exposure to training examples.<ref name="Liu2021Delving">{{cite arXiv |last1=Liu |first1=Sitong |last2=Zhou |first2=Pan |last3=Zhang |first3=Xingchao |last4=Xu |first4=Zhi |last5=Wang |first5=Guang |last6=Zhao |first6=Hao |title=Delving into SAM: An Analytical Study of Sharpness Aware Minimization |eprint=2111.00905 |year=2021 |class=cs.LG}}</ref><ref name="Foret2021"/>
==Scenarios Where SAM Works Well==
SAM has demonstrated significant success in various scenarios:
* '''Improved Generalization:''' SAM consistently leads to better generalization performance across a wide range of deep learning models (especially [[Convolutional Neural Network|Convolutional Neural Networks (CNNs)]] and [[Transformer (machine learning model)|Vision Transformers (ViTs)]]) and datasets (e.g., [[ImageNet]], [[CIFAR-10]], [[CIFAR-100]] dataset|CIFAR-100]]]).<ref name="Foret2021"/>
* '''State-of-the-Art Results:''' It has helped achieve [[state-of-the-art]] or near state-of-the-art performance on several benchmark image classification tasks.<ref name="Foret2021"/>
* '''Robustness to Label Noise:''' SAM inherently provides robustness to [[Label noise|noisy labels]] in training data, performing comparably to methods specifically designed for this purpose.<ref name="Wen2021Mitigating">{{cite arXiv |last1=Wen |first1=Yulei |last2=Liu |first2=Zhen |last3=Zhang |first3=Zhe |last4=Zhang |first4=Yilong |last5=Wang |first5=Linmi |last6=Zhang |first6=Tiantian |title=Mitigating Memorization in Sample Selection for Learning with Noisy Labels |eprint=2110.08529 |year=2021 |class=cs.LG}}</ref><ref name="Zhuang2022Surrogate">{{cite conference |last1=Zhuang |first1=Juntang |last2=Gong |first2=Ming |last3=Liu |first3=Tong |title=Surrogate Gap Minimization Improves Sharpness-Aware Training |book-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=27098–27115 |publisher=PMLR |url=https://proceedings.mlr.press/v162/zhuang22d.html}}</ref>
* '''[[Out-of-distribution generalization|Out-of-Distribution (OOD) Generalization]]:''' Studies have shown that SAM and its variants can improve a model's ability to generalize to data distributions different from the training distribution.<ref name="Croce2021SAMBayes">{{cite arXiv |last1=Croce |first1=Francesco |last2=Hein |first2=Matthias |title=SAM as an Optimal Relaxation of Bayes |eprint=2110.11214 |year=2021 |class=cs.LG}}</ref><ref name="Kim2022Slicing">{{cite conference |last1=Kim |first1=Daehyeon |last2=Kim |first2=Seungone |last3=Kim |first3=Kwangrok |last4=Kim |first4=Sejun |last5=Kim |first5=Jangho |title=Slicing Aided Hyper-dimensional Inference and Fine-tuning for Improved OOD Generalization |book-title=Conference on Neural Information Processing Systems (NeurIPS) 2022 |year=2022 |url=https://openreview.net/forum?id=fN0K3jtnQG_}}</ref>
* '''Gradual [[Domain adaptation|Domain Adaptation]]:''' SAM has shown benefits in settings where models are adapted incrementally across changing data domains.<ref name="Liu2021Delving">{{cite arXiv |last1=Liu |first1=Sitong |last2=Zhou |first2=Pan |last3=Zhang |first3=Xingchao |last4=Xu |first4=Zhi |last5=Wang |first5=Guang |last6=Zhao |first6=Hao |title=Delving into SAM: An Analytical Study of Sharpness Aware Minimization |eprint=2111.00905 |year=2021 |class=cs.LG}}</ref>
* '''[[Overfitting]] Mitigation:''' It is particularly effective in scenarios where models might overfit due to seeing training examples multiple times.<ref name="Foret2021"/>
 
== Limitations ==
==Scenarios Where SAM May Not Work Well or Has Limitations==
*A '''Increasedprimary Computational Cost:''' The most significant drawbacklimitation of SAM is its computational overheadcost. SinceBy it requiresrequiring two forwardgradient computations (one for the ascent and backwardone for the passesdescent) per optimization step, it roughlyapproximately doubles the training time compared to standard optimizers.<ref name="Foret2021"/>
Despite its strengths, SAM also has limitations:
* '''Increased Computational Cost:''' The most significant drawback of SAM is its computational overhead. Since it requires two forward and backward passes per optimization step, it roughly doubles the training time compared to standard optimizers.<ref name="Foret2021"/>
* '''Convergence Guarantees:''' While empirically successful, theoretical understanding of SAM's [[Convergence of an algorithm|convergence properties]] is still evolving. Some works suggest SAM might have limited capability to converge to global minima or precise stationary points with constant step sizes.<ref name="Andriushchenko2022Understanding">{{cite conference |last1=Andriushchenko |first1=Maksym |last2=Flammarion |first2=Nicolas |title=Towards Understanding Sharpness-Aware Minimization |book-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=612–639 |publisher=PMLR |url=https://proceedings.mlr.press/v162/andriushchenko22a.html}}</ref>
* '''Effectiveness of Sharpness Approximation:''' The one-step gradient ascent used to approximate the worst-case perturbation <math>\epsilon</math> might become less accurate as training progresses.<ref name="Kwon2021ASAM">{{cite conference |last1=Kwon |first1=Jungmin |last2=Kim |first2=Jeongseop |last3=Park |first3=Hyunseo |last4=Choi |first4=Il-Chul |title=ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks |book-title=International Conference on Machine Learning (ICML) 2021 |year=2021 |pages=5919–5929 |publisher=PMLR |url=https://proceedings.mlr.press/v139/kwon21a.html}}</ref> Multi-step ascent could be more accurate but would further increase computational costs.
* '''Domain-Specific Efficacy:''' While highly effective in [[computer vision]], its benefits might be less pronounced or require careful tuning in other domains. For instance, some studies found limited or no improvement for [[GPT model|GPT-style language models]] that process each training example only once.<ref name="Chen2023SAMLLM">{{cite arXiv |last1=Chen |first1=Xian |last2=Zhai |first2=Saining |last3=Chan |first3=Crucian |last4=Le |first4=Quoc V. |last5=Houlsby |first5=Graham |title=When is Sharpness-Aware Minimization (SAM) Effective for Large Language Models? |eprint=2308.04932 |year=2023 |class=cs.LG}}</ref>
* '''Potential for Finding "Poor" Flat Minima:''' While the goal is to find generalizing flat minima, some research indicates that in specific settings, sharpness minimization algorithms might converge to flat minima that do not generalize well.<ref name="Liu2023SAMOOD">{{cite conference |last1=Liu |first1=Kai |last2=Li |first2=Yifan |last3=Wang |first3=Hao |last4=Liu |first4=Zhen |last5=Zhao |first5=Jindong |title=When Sharpness-Aware Minimization Meets Data Augmentation: Connect the Dots for OOD Generalization |book-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=Nc0e196NhF}}</ref>
* '''Hyperparameter Sensitivity:''' SAM introduces new hyperparameters, such as the neighborhood size <math>\rho</math>, which may require careful tuning for optimal performance.<ref name="Foret2021"/>
 
The theoretical [[Convergence of an algorithm|convergence properties]] of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point.<ref name="Andriushchenko2022Understanding">{{cite conference |last1=Andriushchenko |first1=Maksym |last2=Flammarion |first2=Nicolas |title=Towards Understanding Sharpness-Aware Minimization |book-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=612–639 |publisher=PMLR |url=https://proceedings.mlr.press/v162/andriushchenko22a.html}}</ref> The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process.<ref name="Kwon2021ASAM">{{cite conference |last1=Kwon |first1=Jungmin |last2=Kim |first2=Jeongseop |last3=Park |first3=Hyunseo |last4=Choi |first4=Il-Chul |title=ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks |book-title=International Conference on Machine Learning (ICML) 2021 |year=2021 |pages=5919–5929 |publisher=PMLR |url=https://proceedings.mlr.press/v139/kwon21a.html}}</ref>
==Recent Progress and Variants==
Research on SAM is highly active, focusing on improving its efficiency, understanding its mechanisms, and extending its applicability. Key areas of progress include:
* '''Efficiency Enhancements:'''
** '''SAMPa (SAM Parallelized):''' Modifies SAM to allow the two gradient computations to be performed in parallel.<ref name="Dou2022SAMPa">{{cite arXiv |last1=Dou |first1=Yong |last2=Zhou |first2=Cong |last3=Zhao |first3=Peng |last4=Zhang |first4=Tong |title=SAMPa: A Parallelized Version of Sharpness-Aware Minimization |eprint=2202.02081 |year=2022 |class=cs.LG}}</ref>
** '''Sparse SAM (SSAM):''' Applies the adversarial perturbation to only a subset of the model parameters.<ref name="Chen2022SSAM">{{cite arXiv |last1=Chen |first1=Wenlong |last2=Liu |first2=Xiaoyu |last3=Yin |first3=Huan |last4=Yang |first4=Tianlong |title=Sparse SAM: Squeezing Sharpness-aware Minimization into a Single Forward-backward Pass |eprint=2205.13516 |year=2022 |class=cs.LG}}</ref>
** '''Single-Step/Reduced-Step SAM:''' Variants that approximate the sharpness-aware update with fewer computations, sometimes using historical gradient information (e.g., S2-SAM,<ref name="Zhuang2022S2SAM">{{cite arXiv |last1=Zhuang |first1=Juntang |last2=Liu |first2=Tong |last3=Tao |first3=Dacheng |title=S2-SAM: A Single-Step, Zero-Extra-Cost Approach to Sharpness-Aware Training |eprint=2206.08307 |year=2022 |class=cs.LG}}</ref> Momentum-SAM<ref name="He2021MomentumSAM">{{cite arXiv |last1=He |first1=Zequn |last2=Liu |first2=Sitong |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Zhang |first5=Cong |last6=Xu |first6=Zhi |last7=Zhao |first7=Hao |title=Momentum Sharpness-Aware Minimization |eprint=2110.03265 |year=2021 |class=cs.LG}}</ref>) or applying SAM steps intermittently. Lookahead SAM<ref name="Liu2022LookaheadSAM">{{cite conference |last1=Liu |first1=Sitong |last2=He |first2=Zequn |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Xu |first5=Zhi |last6=Zhang |first6=Cong |last7=Zhao |first7=Hao |title=Lookahead Sharpness-aware Minimization |book-title=International Conference on Learning Representations (ICLR) 2022 |year=2022 |url=https://openreview.net/forum?id=7s38W2293F}}</ref> also aims to reduce overhead.
* '''Understanding SAM's Behavior:'''
** '''Implicit Bias Studies:''' Research has shown that SAM has an implicit bias towards flatter minima, and even applying SAM for only a few epochs late in training can yield significant generalization benefits.<ref name="Wen2022SAMLandscape">{{cite arXiv |last1=Wen |first1=Yulei |last2=Zhang |first2=Zhe |last3=Liu |first3=Zhen |last4=Li |first4=Yue |last5=Zhang |first5=Tiantian |title=How Does SAM Influence the Loss Landscape? |eprint=2203.08065 |year=2022 |class=cs.LG}}</ref>
** '''Component Analysis:''' Investigations into which components of the gradient contribute most to SAM's effectiveness in the perturbation step.<ref name="Liu2023FriendlySAM">{{cite conference |last1=Liu |first1=Kai |last2=Wang |first2=Hao |last3=Li |first3=Yifan |last4=Liu |first4=Zhen |last5=Zhang |first5=Runpeng |last6=Zhao |first6=Jindong |title=Friendly Sharpness-Aware Minimization |book-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=RndGzfJl4y}}</ref>
* '''Performance and Robustness Enhancements:'''
** '''Adaptive SAM (ASAM):''' Introduces adaptive neighborhood sizes, making the method scale-invariant with respect to the parameters.<ref name="Kwon2021ASAM"/>
** '''Curvature Regularized SAM (CR-SAM):''' Incorporates measures like the normalized [[Hessian matrix|Hessian]] trace to get a more accurate representation of the loss landscape's curvature.<ref name="Kim2022CRSAM">{{cite arXiv |last1=Kim |first1=Minhwan |last2=Lee |first2=Suyeon |last3=Shin |first3=Jonghyun |title=CR-SAM: Curvature Regularized Sharpness-Aware Minimization |eprint=2210.01011 |year=2022 |class=cs.LG}}</ref>
** '''Random SAM (R-SAM):''' Employs random smoothing techniques in conjunction with SAM.<ref name="Singh2021RSAM">{{cite arXiv |last1=Singh |first1=Sandeep Kumar |last2=Ahn |first2=Kyungsu |last3=Oh |first3=Songhwai |title=R-SAM: Random Structure-Aware Minimization for Generalization and Robustness |eprint=2110.07486 |year=2021 |class=cs.LG}}</ref>
** '''Friendly SAM (F-SAM):''' Aims to refine the perturbation by focusing on the stochastic gradient noise component.<ref name="Liu2023FriendlySAM"/>
** '''Delta-SAM:''' This term has been used to describe approaches that use dynamic reweighting or other techniques to approximate per-instance adversarial perturbations more efficiently. Specific implementations and papers may vary.<ref name="Du2022SWA-SAM">{{cite arXiv |last1=Du |first1=Yong |last2=Li |first2=Chang |last3=Kar |first3=Purvak |last4=Krishnapriyan |first4=Adarsh |last5=Xiao |first5=Li |last6=Anil |first6=Rohan |title=An Efficient Way to Improve Generalization: Stochastic Weight Averaging Meets SAM |eprint=2203.04151 |year=2022 |class=cs.LG}}</ref> ** '''μP² (Maximal Update and Perturbation Parameterization):''' Proposes layerwise perturbation scaling to ensure SAM's effectiveness in very wide neural networks.<ref name="Zhang2022Ensemble">{{cite arXiv |last1=Zhang |first1=Jerry |last2=Chen |first2=Tianle |last3=Du |first3=Simon S. |title=Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning |eprint=2202.01074 |year=2022 |class=cs.LG}}</ref> * '''Broader Theoretical Frameworks:'''
** Development of universal classes of sharpness-aware minimization algorithms that can utilize different measures of sharpness beyond the one used in the original SAM (e.g., Frob-SAM using [[Frobenius norm]] of the Hessian, Det-SAM using the [[determinant]] of the Hessian).<ref name="Zhou2023SAMUnified">{{cite arXiv |last1=Zhou |first1=Kaizheng |last2=Zhang |first2=Yulai |last3=Tao |first3=Dacheng |title=Sharpness-Aware Minimization: A Unified View and A New Theory |eprint=2305.10276 |year=2023 |class=cs.LG}}</ref>
 
The effectiveness of SAM can also be ___domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as [[GPT model|GPT-style language models]] where each training example is seen only once, has been reported as limited in some studies.<ref name="Chen2023SAMLLM">{{cite arXiv |last1=Chen |first1=Xian |last2=Zhai |first2=Saining |last3=Chan |first3=Crucian |last4=Le |first4=Quoc V. |last5=Houlsby |first5=Graham |title=When is Sharpness-Aware Minimization (SAM) Effective for Large Language Models? |eprint=2308.04932 |year=2023 |class=cs.LG}}</ref> Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization.<ref name="Liu2023SAMOOD">{{cite conference |last1=Liu |first1=Kai |last2=Li |first2=Yifan |last3=Wang |first3=Hao |last4=Liu |first4=Zhen |last5=Zhao |first5=Jindong |title=When Sharpness-Aware Minimization Meets Data Augmentation: Connect the Dots for OOD Generalization |book-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=Nc0e196NhF}}</ref> The algorithm also introduces the neighborhood size <math>\rho</math> as a new hyperparameter, which requires tuning.<ref name="Foret2021"/>
==Current Open Problems and Future Directions==
Despite significant advancements, several open questions and challenges remain:
* '''Bridging the Efficiency Gap:''' Developing SAM variants that achieve comparable generalization improvements with computational costs close to standard optimizers remains a primary goal.<ref name="Dou2022SAMPa"/><ref name="Chen2022SSAM"/><ref name="Zhuang2022S2SAM"/><ref name="He2021MomentumSAM"/><ref name="Liu2022LookaheadSAM"/>
* '''Deepening Theoretical Understanding:'''
** Providing tighter generalization bounds that fully explain SAM's empirical success.<ref name="Foret2021"/><ref name="Andriushchenko2022Understanding"/><ref name="Neyshabur2021WhatTransferred">{{cite arXiv |last1=Neyshabur |first1=Behnam |last2=Sedghi |first2=Hanie |last3=Zhang |first3=Chiyuan |title=What is being Transferred in Transfer Learning? |eprint=2008.11687 |year=2020 |class=cs.LG}}</ref> ** Establishing more comprehensive convergence guarantees for SAM and its variants under diverse conditions.<ref name="Andriushchenko2022Understanding"/><ref name="Mi2022ConvergenceSAM">{{cite arXiv |last1=Mi |first1=Guanlong |last2=Lyu |first2=Lijun |last3=Wang |first3=Yuan |last4=Wang |first4=Lili |title=On the Convergence of Sharpness-Aware Minimization: A Trajectory and Landscape Analysis |eprint=2206.03046 |year=2022 |class=cs.LG}}</ref>
** Understanding the interplay between sharpness, flatness, and generalization, and why SAM-found minima often generalize well.<ref name="Liu2023SAMOOD"/><ref name="Jiang2020FantasticMeasures">{{cite conference |last1=Jiang |first1=Yiding |last2=Neyshabur |first2=Behnam |last3=Mobahi |first3=Hossein |last4=Krishnan |first4=Dilip |last5=Bengio |first5=Samy |title=Fantastic Generalization Measures and Where to Find Them |book-title=International Conference on Learning Representations (ICLR) 2020 |year=2020 |url=https://openreview.net/forum?id=SJgMfnR9Y7}}</ref>
* '''Improved Sharpness Approximation:''' Designing more sophisticated and computationally feasible methods to find or approximate the "worst-case" loss in a neighborhood.<ref name="Kwon2021ASAM"/>
* '''Hyperparameter Optimization and Robustness:''' Developing adaptive methods for setting SAM's hyperparameters (like <math>\rho</math>) or reducing its sensitivity to them.<ref name="Kwon2021ASAM"/>
* '''Applicability Across Diverse Domains:''' Further exploring and optimizing SAM for a wider range of machine learning tasks and model architectures beyond computer vision, including [[large language model]]s,<ref name="Chen2023SAMLLM"/> [[reinforcement learning]], and [[graph neural network]]s.
* '''Distinguishing Generalizing vs. Non-Generalizing Flat Minima:''' Investigating how SAM navigates the loss landscape to select flat minima that are genuinely good for generalization, and avoiding those that might be flat but still lead to poor out-of-sample performance.<ref name="Liu2023SAMOOD"/><ref name="Jiang2020FantasticMeasures"/>
* '''Interaction with Other Techniques:''' Understanding how SAM interacts with other regularization techniques, [[Data augmentation|data augmentation]] methods, and architectural choices.
 
== Research, Variants, and Enhancements ==
SAM represents a significant step towards building more robust and generalizable deep learning models by explicitly considering the geometry of the loss landscape. Ongoing research continues to refine its efficiency, theoretical underpinnings, and practical applications.
**Active '''Single-Step/Reduced-Stepresearch on SAM:''' Variantsfocuses thaton reducing its computational overhead and improving its performance. Several variants have been proposed to approximatemake the sharpness-awarealgorithm updatemore withefficient. fewerThese include methods that attempt to parallelize the two gradient computations, sometimesapply usingthe historicalperturbation gradientto informationonly (e.g.a subset of parameters, S2or reduce the number of computation steps required.<ref name="Dou2022SAMPa">{{cite arXiv |last1=Dou |first1=Yong |last2=Zhou |first2=Cong |last3=Zhao |first3=Peng |last4=Zhang |first4=Tong |title=SAMPa: A Parallelized Version of Sharpness-Aware Minimization |eprint=2202.02081 |year=2022 |class=cs.LG}}</ref><ref name="Chen2022SSAM">{{cite arXiv |last1=Chen |first1=Wenlong |last2=Liu |first2=Xiaoyu |last3=Yin |first3=Huan |last4=Yang |first4=Tianlong |title=Sparse SAM,: Squeezing Sharpness-aware Minimization into a Single Forward-backward Pass |eprint=2205.13516 |year=2022 |class=cs.LG}}</ref><ref name="Zhuang2022S2SAM">{{cite arXiv |last1=Zhuang |first1=Juntang |last2=Liu |first2=Tong |last3=Tao |first3=Dacheng |title=S2-SAM: A Single-Step, Zero-Extra-Cost Approach to Sharpness-Aware Training |eprint=2206.08307 |year=2022 |class=cs.LG}}</ref> Momentum-Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden.<ref name="He2021MomentumSAM">{{cite arXiv |last1=He |first1=Zequn |last2=Liu |first2=Sitong |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Zhang |first5=Cong |last6=Xu |first6=Zhi |last7=Zhao |first7=Hao |title=Momentum Sharpness-Aware Minimization |eprint=2110.03265 |year=2021 |class=cs.LG}}</ref>) or applying SAM steps intermittently. Lookahead SAM<ref name="Liu2022LookaheadSAM">{{cite conference |last1=Liu |first1=Sitong |last2=He |first2=Zequn |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Xu |first5=Zhi |last6=Zhang |first6=Cong |last7=Zhao |first7=Hao |title=Lookahead Sharpness-aware Minimization |book-title=International Conference on Learning Representations (ICLR) 2022 |year=2022 |url=https://openreview.net/forum?id=7s38W2293F}}</ref> also aims to reduce overhead.
 
To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM)<ref name="Kwon2021ASAM"/> or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM).<ref name="Kim2022CRSAM">{{cite arXiv |last1=Kim |first1=Minhwan |last2=Lee |first2=Suyeon |last3=Shin |first3=Jonghyun |title=CR-SAM: Curvature Regularized Sharpness-Aware Minimization |eprint=2210.01011 |year=2022 |class=cs.LG}}</ref> Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing.<ref name="Liu2023FriendlySAM">{{cite conference |last1=Liu |first1=Kai |last2=Wang |first2=Hao |last3=Li |first3=Yifan |last4=Liu |first4=Zhen |last5=Zhang |first5=Runpeng |last6=Zhao |first6=Jindong |title=Friendly Sharpness-Aware Minimization |book-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=RndGzfJl4y}}</ref><ref name="Singh2021RSAM">{{cite arXiv |last1=Singh |first1=Sandeep Kumar |last2=Ahn |first2=Kyungsu |last3=Oh |first3=Songhwai |title=R-SAM: Random Structure-Aware Minimization for Generalization and Robustness |eprint=2110.07486 |year=2021 |class=cs.LG}}</ref>
 
**Theoretical '''Implicitwork Biascontinues Studies:'''to Researchanalyze hasthe shownalgorithm's thatbehavior, SAMincluding has anits implicit bias towards flatter minima, and eventhe applyingdevelopment SAMof forbroader onlyframeworks afor fewsharpness-aware epochsoptimization late in trainingthat canuse yielddifferent significantmeasures generalizationof benefitssharpness.<ref name="Wen2022SAMLandscape">{{cite arXiv |last1=Wen |first1=Yulei |last2=Zhang |first2=Zhe |last3=Liu |first3=Zhen |last4=Li |first4=Yue |last5=Zhang |first5=Tiantian |title=How Does SAM Influence the Loss Landscape? |eprint=2203.08065 |year=2022 |class=cs.LG}}</ref><ref name="Zhou2023SAMUnified">{{cite arXiv |last1=Zhou |first1=Kaizheng |last2=Zhang |first2=Yulai |last3=Tao |first3=Dacheng |title=Sharpness-Aware Minimization: A Unified View and A New Theory |eprint=2305.10276 |year=2023 |class=cs.LG}}</ref>
 
==References==
{{reflist}}
 
== References ==
<!-- Inline citations added to your article will automatically display here. See en.wikipedia.org/wiki/WP:REFB for instructions on how to add citations. -->