Content deleted Content added
ToadetteEdit (talk | contribs) Declining submission: npov - Submission is not written in a formal, neutral encyclopedic tone (AFCH) |
No edit summary |
||
Line 9:
{{technical}}
'''Sharpness Aware Minimization''' ('''SAM''') is an [[optimization algorithm]]
== Underlying Principle ==
SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value:<ref name="Foret2021"/>
<math>\min_{w} \max_{\|\epsilon\|_p \le \rho} L_{\text{train}}(w + \epsilon) + \lambda \|w\|_2^2</math>
* <math>w</math> are the model parameters.▼
* <math>L_{\text{train}}</math> is the training loss.▼
* <math>\epsilon</math> is an adversarial perturbation.▼
* <math>\rho</math> is a [[hyperparameter (machine learning)|hyperparameter]] defining the size of the neighborhood (<math>L_p</math> ball) around <math>w</math>.▼
* An optional [[Regularization (mathematics)|L2 regularization]] term can also be included.▼
In this formulation:
In practice, solving the inner maximization problem exactly is often intractable. SAM approximates the solution by performing a single [[gradient ascent]] step to find the adversarial perturbation <math>\epsilon</math>:<ref name="Foret2021"/>▼
▲* <math>L_{\text{train}}</math> is the
▲* <math>\rho</math> is a [[hyperparameter (machine learning)|hyperparameter]]
▲* An optional [[Regularization (mathematics)|L2 regularization]] term,
▲
<math>\epsilon(w) = \rho \frac{\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_2}</math>
The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, <math>w_{\text{adv}} = w + \epsilon(w)</math>, by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights <math>w</math> using the gradient calculated at these perturbed weights, <math>\nabla L_{\text{train}}(w_{\text{adv}})</math>. This update is typically performed using a standard optimizer like [[Stochastic gradient descent|SGD]] or [[Adam (optimization algorithm)|Adam]].<ref name="Foret2021"/>
== Application and Performance ==
The algorithm has also been found to be effective in training models with [[Label noise|noisy labels]], where it performs comparably to methods designed specifically for this problem.<ref name="Wen2021Mitigating">{{cite arXiv |last1=Wen |first1=Yulei |last2=Liu |first2=Zhen |last3=Zhang |first3=Zhe |last4=Zhang |first4=Yilong |last5=Wang |first5=Linmi |last6=Zhang |first6=Tiantian |title=Mitigating Memorization in Sample Selection for Learning with Noisy Labels |eprint=2110.08529 |year=2021 |class=cs.LG}}</ref><ref name="Zhuang2022Surrogate">{{cite conference |last1=Zhuang |first1=Juntang |last2=Gong |first2=Ming |last3=Liu |first3=Tong |title=Surrogate Gap Minimization Improves Sharpness-Aware Training |book-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=27098–27115 |publisher=PMLR |url=https://proceedings.mlr.press/v162/zhuang22d.html}}</ref> Some studies indicate that SAM and its variants can improve [[Out-of-distribution generalization|out-of-distribution (OOD) generalization]], which is a model's ability to perform well on data from distributions not seen during training.<ref name="Croce2021SAMBayes">{{cite arXiv |last1=Croce |first1=Francesco |last2=Hein |first2=Matthias |title=SAM as an Optimal Relaxation of Bayes |eprint=2110.11214 |year=2021 |class=cs.LG}}</ref><ref name="Kim2022Slicing">{{cite conference |last1=Kim |first1=Daehyeon |last2=Kim |first2=Seungone |last3=Kim |first3=Kwangrok |last4=Kim |first4=Sejun |last5=Kim |first5=Jangho |title=Slicing Aided Hyper-dimensional Inference and Fine-tuning for Improved OOD Generalization |book-title=Conference on Neural Information Processing Systems (NeurIPS) 2022 |year=2022 |url=https://openreview.net/forum?id=fN0K3jtnQG_}}</ref> Other areas where it has been applied include gradual [[___domain adaptation]] and mitigating [[overfitting]] in scenarios with repeated exposure to training examples.<ref name="Liu2021Delving">{{cite arXiv |last1=Liu |first1=Sitong |last2=Zhou |first2=Pan |last3=Zhang |first3=Xingchao |last4=Xu |first4=Zhi |last5=Wang |first5=Guang |last6=Zhao |first6=Hao |title=Delving into SAM: An Analytical Study of Sharpness Aware Minimization |eprint=2111.00905 |year=2021 |class=cs.LG}}</ref><ref name="Foret2021"/>
▲* '''Improved Generalization:''' SAM consistently leads to better generalization performance across a wide range of deep learning models (especially [[Convolutional Neural Network|Convolutional Neural Networks (CNNs)]] and [[Transformer (machine learning model)|Vision Transformers (ViTs)]]) and datasets (e.g., [[ImageNet]], [[CIFAR-10]], [[CIFAR-100]] dataset|CIFAR-100]]]).<ref name="Foret2021"/>
== Limitations ==
▲* '''Increased Computational Cost:''' The most significant drawback of SAM is its computational overhead. Since it requires two forward and backward passes per optimization step, it roughly doubles the training time compared to standard optimizers.<ref name="Foret2021"/>
The theoretical [[Convergence of an algorithm|convergence properties]] of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point.<ref name="Andriushchenko2022Understanding">{{cite conference |last1=Andriushchenko |first1=Maksym |last2=Flammarion |first2=Nicolas |title=Towards Understanding Sharpness-Aware Minimization |book-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=612–639 |publisher=PMLR |url=https://proceedings.mlr.press/v162/andriushchenko22a.html}}</ref> The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process.<ref name="Kwon2021ASAM">{{cite conference |last1=Kwon |first1=Jungmin |last2=Kim |first2=Jeongseop |last3=Park |first3=Hyunseo |last4=Choi |first4=Il-Chul |title=ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks |book-title=International Conference on Machine Learning (ICML) 2021 |year=2021 |pages=5919–5929 |publisher=PMLR |url=https://proceedings.mlr.press/v139/kwon21a.html}}</ref>
** '''Single-Step/Reduced-Step SAM:''' Variants that approximate the sharpness-aware update with fewer computations, sometimes using historical gradient information (e.g., S2-SAM,<ref name="Zhuang2022S2SAM">{{cite arXiv |last1=Zhuang |first1=Juntang |last2=Liu |first2=Tong |last3=Tao |first3=Dacheng |title=S2-SAM: A Single-Step, Zero-Extra-Cost Approach to Sharpness-Aware Training |eprint=2206.08307 |year=2022 |class=cs.LG}}</ref> Momentum-SAM<ref name="He2021MomentumSAM">{{cite arXiv |last1=He |first1=Zequn |last2=Liu |first2=Sitong |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Zhang |first5=Cong |last6=Xu |first6=Zhi |last7=Zhao |first7=Hao |title=Momentum Sharpness-Aware Minimization |eprint=2110.03265 |year=2021 |class=cs.LG}}</ref>) or applying SAM steps intermittently. Lookahead SAM<ref name="Liu2022LookaheadSAM">{{cite conference |last1=Liu |first1=Sitong |last2=He |first2=Zequn |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Xu |first5=Zhi |last6=Zhang |first6=Cong |last7=Zhao |first7=Hao |title=Lookahead Sharpness-aware Minimization |book-title=International Conference on Learning Representations (ICLR) 2022 |year=2022 |url=https://openreview.net/forum?id=7s38W2293F}}</ref> also aims to reduce overhead.▼
** '''Implicit Bias Studies:''' Research has shown that SAM has an implicit bias towards flatter minima, and even applying SAM for only a few epochs late in training can yield significant generalization benefits.<ref name="Wen2022SAMLandscape">{{cite arXiv |last1=Wen |first1=Yulei |last2=Zhang |first2=Zhe |last3=Liu |first3=Zhen |last4=Li |first4=Yue |last5=Zhang |first5=Tiantian |title=How Does SAM Influence the Loss Landscape? |eprint=2203.08065 |year=2022 |class=cs.LG}}</ref>▼
The effectiveness of SAM can also be ___domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as [[GPT model|GPT-style language models]] where each training example is seen only once, has been reported as limited in some studies.<ref name="Chen2023SAMLLM">{{cite arXiv |last1=Chen |first1=Xian |last2=Zhai |first2=Saining |last3=Chan |first3=Crucian |last4=Le |first4=Quoc V. |last5=Houlsby |first5=Graham |title=When is Sharpness-Aware Minimization (SAM) Effective for Large Language Models? |eprint=2308.04932 |year=2023 |class=cs.LG}}</ref> Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization.<ref name="Liu2023SAMOOD">{{cite conference |last1=Liu |first1=Kai |last2=Li |first2=Yifan |last3=Wang |first3=Hao |last4=Liu |first4=Zhen |last5=Zhao |first5=Jindong |title=When Sharpness-Aware Minimization Meets Data Augmentation: Connect the Dots for OOD Generalization |book-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=Nc0e196NhF}}</ref> The algorithm also introduces the neighborhood size <math>\rho</math> as a new hyperparameter, which requires tuning.<ref name="Foret2021"/>
== Research, Variants, and Enhancements ==
▲
To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM)<ref name="Kwon2021ASAM"/> or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM).<ref name="Kim2022CRSAM">{{cite arXiv |last1=Kim |first1=Minhwan |last2=Lee |first2=Suyeon |last3=Shin |first3=Jonghyun |title=CR-SAM: Curvature Regularized Sharpness-Aware Minimization |eprint=2210.01011 |year=2022 |class=cs.LG}}</ref> Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing.<ref name="Liu2023FriendlySAM">{{cite conference |last1=Liu |first1=Kai |last2=Wang |first2=Hao |last3=Li |first3=Yifan |last4=Liu |first4=Zhen |last5=Zhang |first5=Runpeng |last6=Zhao |first6=Jindong |title=Friendly Sharpness-Aware Minimization |book-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=RndGzfJl4y}}</ref><ref name="Singh2021RSAM">{{cite arXiv |last1=Singh |first1=Sandeep Kumar |last2=Ahn |first2=Kyungsu |last3=Oh |first3=Songhwai |title=R-SAM: Random Structure-Aware Minimization for Generalization and Robustness |eprint=2110.07486 |year=2021 |class=cs.LG}}</ref>
▲
==References==
{{reflist}}
== References ==
<!-- Inline citations added to your article will automatically display here. See en.wikipedia.org/wiki/WP:REFB for instructions on how to add citations. -->
|