Sharpness aware minimization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:25, 25 June 2025 edit AnomieBOT (talk \| contribs) Bots 6,862,187 edits m Dating maintenance tags: {{Technical}} ← Previous edit		Latest revision as of 09:00, 27 July 2025 edit undo Citation bot (talk \| contribs) Bots 5,870,748 edits Removed URL that duplicated identifier. Removed access-date with no URL. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Abductive \| Category:Orphaned articles from June 2025 \| #UCB_Category 462/875
(8 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|~~Optimization~~Machine ~~algorithm~~learning ~~for~~optimization ~~improving generalization in machine learning models~~algorithm}} {{Multiple issues\| {{technical\|date=June 2025}} {{AI-generated\|date=June 2025}} {{~~Uncategorized~~Orphan\|date=June 2025}}▼ }} '''Sharpness Aware Minimization''' ('''SAM''') is an [[optimization algorithm]] used in [[machine learning]] that aims to improve model [[generalization (machine learning)\|generalization]]. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test [[data set\|data]], which can lead to better performance on unseen data.<ref name="Foret2021">{{cite conference \|last1=Foret \|first1=Pierre \|last2=Kleiner \|first2=Ariel \|last3=Mobahi \|first3=Hossein \|last4=Neyshabur \|first4=Behnam \|year=2021 \|title=Sharpness-Aware Minimization for Efficiently Improving Generalization \|url=https://openreview.net/forum?id=6Tm1mposlrM \|conference= \|arxiv=2010.01412 \|book-title=International Conference on Learning Representations (ICLR) 2021 ~~\|year=2021 \|arxiv=2010.01412 \|url=https://openreview.net/forum?id=6Tm1m_rRrwY~~}}</ref> The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur.<ref name="Foret2021"/> Line 30 ⟶ 31: SAM has been applied in various machine learning contexts, primarily in [[computer vision]]. Research has shown it can improve generalization performance in models such as [[Convolutional Neural Network\|Convolutional Neural Networks (CNNs)]] and [[Transformer (machine learning model)\|Vision Transformers (ViTs)]] on image datasets including [[ImageNet]], [[CIFAR-10]], and [[CIFAR-100]].<ref name="Foret2021"/> The algorithm has also been found to be effective in training models with [[Label noise\|noisy labels]], where it performs comparably to methods designed specifically for this problem.<ref name="~~Wen2021Mitigating~~Zhuang2022Surrogate">{{cite ~~arXiv~~conference \|last1=~~Wen~~Zhuang \|first1=~~Yulei~~Juntang \|last2=~~Liu~~Gong \|first2=~~Zhen~~Ming \|last3=~~Zhang~~Liu \|first3=~~Zhe~~Tong \|~~last4~~year=~~Zhang \|first4=Yilong \|last5=Wang \|first5=Linmi \|last6=Zhang \|first6=Tiantian~~2022 \|title=~~Mitigating~~Surrogate ~~Memorization~~Gap inMinimization ~~Sample~~Improves ~~Selection~~Sharpness-Aware ~~for Learning with Noisy Labels~~Training \|~~eprint~~url=~~2110~~https://openreview.~~08529 \|year~~net/forum?id=~~2021~~edONMAnhLu- \|~~class=cs.LG}}</ref><ref name="Zhuang2022Surrogate">{{cite~~ conference ~~\|last1~~=~~Zhuang~~ \|~~first1~~publisher=~~Juntang~~PMLR \|~~last2~~pages=~~Gong \|first2=Ming \|last3=Liu \|first3=Tong \|title=Surrogate Gap Minimization Improves Sharpness-Aware Training~~27098–27115 \|book-title=International Conference on Machine Learning (ICML) 2022 ~~\|year=2022 \|pages=27098–27115 \|publisher=PMLR \|url=https://proceedings.mlr.press/v162/zhuang22d.html~~}}</ref> Some studies indicate that SAM and its variants can improve [[Out-of-distribution generalization\|out-of-distribution (OOD) generalization]], which is a model's ability to perform well on data from distributions not seen during training.<ref name="Croce2021SAMBayes">{{cite journal \|last1=Croce \|first1=Francesco \|last2=Hein \|first2=Matthias \|title=High-Resolution "Magic"-Field Spectroscopy on Trapped Polyatomic Molecules \|journal=Physical Review Letters \|arxiv=2110.11214 \|year=2021 \|volume=127 \|issue=17 \|page=173602 \|doi=10.1103/PhysRevLett.127.173602 \|pmid=34739278 \|bibcode=2021PhRvL.127q3602P }}</ref><ref name="Kim2022Slicing">{{cite conference \|last1=Kim \|first1=Daehyeon \|last2=Kim \|first2=Seungone \|last3=Kim \|first3=Kwangrok \|last4=Kim \|first4=Sejun \|last5=Kim \|first5=Jangho \|title=Slicing Aided Hyper-dimensional Inference and Fine-tuning for Improved OOD Generalization \|book-title=Conference on Neural Information Processing Systems (NeurIPS) 2022 \|year=2022 \|url=https://openreview.net/forum?id=fN0K3jtnQG_}}</ref> Other areas where it has been applied include gradual [[___domain adaptation]] and mitigating [[overfitting]] in scenarios with repeated exposure to training examples.<ref name="Liu2021Delving">{{cite arXiv \|last1=Liu \|first1=Sitong \|last2=Zhou \|first2=Pan \|last3=Zhang \|first3=Xingchao \|last4=Xu \|first4=Zhi \|last5=Wang \|first5=Guang \|last6=Zhao \|first6=Hao \|title=Delving into SAM: An Analytical Study of Sharpness Aware Minimization \|eprint=2111.00905 \|year=2021 \|class=cs.LG}}</ref><ref name="Foret2021"/> == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers.<ref name="Foret2021"/> The theoretical [[Convergence of an algorithm\|convergence properties]] of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point.<ref name="Andriushchenko2022Understanding">{{cite conference \|last1=Andriushchenko \|first1=Maksym \|last2=Flammarion \|first2=Nicolas \|title=Towards Understanding Sharpness-Aware Minimization \|book-title=International Conference on Machine Learning (ICML) 2022 \|year=2022 \|pages=612–639 \|publisher=PMLR \|url=https://proceedings.mlr.press/v162/andriushchenko22a.html}}</ref> The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process.<ref name="Kwon2021ASAM">{{cite conference \|last1=Kwon \|first1=Jungmin \|last2=Kim \|first2=Jeongseop \|last3=Park \|first3=Hyunseo \|last4=Choi \|first4=Il-Chul \|year=2021 \|title=ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks \|url=https://proceedings.mlr.press/v139/kwon21b.html \|conference= \|publisher=PMLR \|pages=5919–5929 \|book-title=International Conference on Machine Learning (ICML) 2021 ~~\|year=2021 \|pages=5919–5929 \|publisher=PMLR \|url=https://proceedings.mlr.press/v139/kwon21a.html~~}}</ref> The effectiveness of SAM can also be ___domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as [[GPT model\|GPT-style language models]] where each training example is seen only once, has been reported as limited in some studies.<ref name="Chen2023SAMLLM">{{cite arXiv \|last1=Chen \|first1=Xian \|last2=Zhai \|first2=Saining \|last3=Chan \|first3=Crucian \|last4=Le \|first4=Quoc V. \|last5=Houlsby \|first5=Graham \|title=When is Sharpness-Aware Minimization (SAM) Effective for Large Language Models? \|eprint=2308.04932 \|year=2023 \|class=cs.LG}}</ref> Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization.<ref name="Liu2023SAMOOD">{{cite conference \|last1=Liu \|first1=Kai \|last2=Li \|first2=Yifan \|last3=Wang \|first3=Hao \|last4=Liu \|first4=Zhen \|last5=Zhao \|first5=Jindong \|title=When Sharpness-Aware Minimization Meets Data Augmentation: Connect the Dots for OOD Generalization \|book-title=International Conference on Learning Representations (ICLR) 2023 \|year=2023 \|url=https://openreview.net/forum?id=Nc0e196NhF}}</ref> The algorithm also introduces the neighborhood size <math>\rho</math> as a new hyperparameter, which requires tuning.<ref name="Foret2021"/> == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required.<ref name="Dou2022SAMPa">{{cite arXiv \|~~last1~~eprint=~~Dou~~2410.10683 \|~~first1~~class=~~Yong~~cs.LG \|~~last2~~first1=~~Zhou~~Wanyun \|~~first2~~last1=~~Cong~~Xie \|~~last3~~first2=~~Zhao~~Thomas \|~~first3~~last2=~~Peng \|last4=Zhang \|first4=Tong~~Pethick \|title=SAMPa: ~~A Parallelized Version of~~ Sharpness-~~Aware~~aware Minimization Parallelized \|~~eprint~~last3=~~2202.02081~~Cevher \|~~year~~first3=~~2022~~Volkan \|~~class~~year=~~cs.LG~~2022}}</ref><ref name="~~Chen2022SSAM~~u277">{{~~cite arXiv~~citation \|last1=~~Chen~~Mi \|first1=~~Wenlong \|last2=Liu \|first2=Xiaoyu \|last3=Yin \|first3=Huan \|last4=Yang \|first4=Tianlong~~Peng \|title=~~Sparse SAM: Squeezing~~Make Sharpness-~~aware~~Aware Minimization ~~into~~Stronger: aA ~~Single~~Sparsified ~~Forward-backward~~Perturbation ~~Pass~~Approach \|~~eprint~~date=~~2205.13516~~2022 \|~~year~~page=~~2022~~ \|~~class~~arxiv=cs2210.~~LG}}</ref><ref~~05177 ~~name~~\|last2=~~"Zhuang2022S2SAM">{{cite~~Shen ~~arXiv~~\|first2=Li \|~~last1~~last3=~~Zhuang~~Ren \|~~first1~~first3=~~Juntang~~Tianhe \|~~last2~~last4=~~Liu~~Zhou \|~~first2~~first4=~~Tong~~Yiyi \|~~last3~~last5=~~Tao~~Sun \|~~first3~~first5=~~Dacheng~~Xiaoshuai \|~~title~~last6=~~S2-SAM: A Single-Step, Zero-Extra-Cost Approach to Sharpness-Aware Training~~Ji \|~~eprint~~first6=~~2206.08307~~Rongrong \|~~year~~last7=~~2022~~Tao \|~~class~~first7=~~cs.LG~~Dacheng }}</ref> ~~Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden.~~<ref name="~~He2021MomentumSAM~~k651">{{cite ~~journal~~conference \|last1=HeJi \|first1=~~Zequn~~Jie \|last2=~~Liu~~Li \|first2=~~Sitong~~Gen \|last3=~~Zhang~~Fu \|first3=~~Xingchao~~Jingjing \|last4=~~Zhou~~Afghah \|first4=~~Pan~~Fatemeh \|last5=~~Zhang~~Guo \|first5=~~Cong~~Linke \|last6=XuYuan \|first6=~~Zhi~~Xiaoyong \|last7=~~Zhao~~Ma \|first7=~~Hao~~Xiaolong \|date=2025-06-05 \|title=~~Optical~~Proceedings ~~secret~~of ~~sharing~~the ~~with~~38th ~~cascaded~~International ~~metasurface~~Conference ~~holography~~on ~~\|journal=Science~~Neural ~~Advances~~Information Processing Systems \|~~arxiv~~url=~~2110~~https://dl.acm.org/doi/10.5555/3737916.~~03265~~3739321 \|~~year~~publisher=~~2021~~Curran Associates Inc. \|~~volume~~publication-place=7Red Hook, NY, USA \|~~issue~~volume=1637 \|~~doi~~page=~~10.1126/sciadv.abf9718~~ \|~~pmid~~pages=~~33853788~~44269–44290 \|~~pmc~~isbn=~~8046362~~979-8--33131438-5 \|~~bibcode~~access-date=~~2021SciA....7.9718G~~ 2025-06-26}}</ref> Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden.<ref name="Liu2022LookaheadSAM">{{cite conference \|last1=~~Liu~~Yu \|first1=~~Sitong~~Runsheng \|last2=HeZhang \|first2=~~Zequn~~Youzhi \|last3=~~Zhang~~Kwok \|first3=~~Xingchao~~James \|~~last4~~year=~~Zhou~~2024 \|~~first4~~title=~~Pan~~Improving ~~\|last5=Xu~~Sharpness-Aware ~~\|first5=Zhi~~Minimization ~~\|last6=Zhang~~by ~~\|first6=Cong~~Lookahead \|~~last7~~url=~~Zhao~~https://proceedings.mlr.press/v235/yu24q.html \|~~first7~~conference=~~Hao \|title=Lookahead Sharpness-aware Minimization~~ \|book-title=International Conference on Learning Representations (ICLR) 2022 ~~\|year=2022 \|url=https://openreview.net/forum?id=7s38W2293F~~}}</ref> To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM)<ref name="Kwon2021ASAM"/> or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM).<ref name="Kim2022CRSAM">{{cite journal \|last1=Kim \|first1=Minhwan \|last2=Lee \|first2=Suyeon \|last3=Shin \|first3=Jonghyun \|title=MRChem Multiresolution Analysis Code for Molecular Electronic Structure Calculations: Performance and Scaling Properties \|journal=Journal of Chemical Theory and Computation \|arxiv=2210.01011 \|year=2023 \|volume=19 \|issue=1 \|pages=137–146 \|doi=10.1021/acs.jctc.2c00982 \|pmid=36410396 \|pmc=9835826 }}</ref> Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing.<ref name="~~Liu2023FriendlySAM~~m141">{{cite conference \|last1=~~Liu~~Li \|first1=~~Kai~~Tao \|last2=~~Wang~~Zhou \|first2=~~Hao~~Pan \|last3=LiHe \|first3=~~Yifan~~Zhengbao \|last4=~~Liu~~Cheng \|first4=~~Zhen~~Xinwen \|last5=~~Zhang~~Huang \|first5=~~Runpeng~~Xiaolin \|~~last6~~title=~~Zhao~~2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) \|~~first6~~date=~~Jindong~~2024-06-16 \|~~title~~chapter=Friendly Sharpness-Aware Minimization \|~~book-title~~page=~~International~~ ~~Conference on Learning Representations (ICLR) 2023~~\|publisher=IEEE \|~~year~~pages=~~2023~~5631–5640 \|~~url~~doi=~~https://openreview~~10.~~net~~1109/~~forum?id~~CVPR52733.2024.00538 \|isbn=~~RndGzfJl4y~~979-8-3503-5300-6 }}</ref><ref name="~~Singh2021RSAM~~t248">{{cite ~~arXiv~~journal \|last1=~~Singh~~Liu \|first1=~~Sandeep Kumar~~Yong \|last2=~~Ahn~~Mai \|first2=~~Kyungsu~~Siqi \|last3=OhCheng \|first3=~~Songhwai~~Minhao \|~~title~~last4=RChen \|first4=Xiangning \|last5=Hsieh \|first5=Cho-~~SAM:~~Jui \|last6=You \|first6=Yang \|date=2022-12-06 \|title=Random ~~Structure~~Sharpness-Aware Minimization ~~for~~\|url=https://papers.nips.cc/paper_files/paper/2022/hash/9b79416c0dc4b09feaa169ed5cdd63d4-Abstract-Conference.html ~~Generalization~~\|journal=Advances ~~and~~in ~~Robustness~~Neural Information Processing Systems \|~~eprint~~volume=~~2110.07486~~35 \|~~year~~pages=~~2021~~24543–24556 \|~~class~~access-date=~~cs.LG~~2025-06-26}}</ref> Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.<ref name="Wen2022SAMLandscape">{{cite arXiv \|last1=Wen \|first1=Yulei \|last2=Zhang \|first2=Zhe \|last3=Liu \|first3=Zhen \|last4=Li \|first4=Yue \|last5=Zhang \|first5=Tiantian \|title=How Does SAM Influence the Loss Landscape? \|eprint=2203.08065 \|year=2022 \|class=cs.LG}}</ref><ref name="Zhou2023SAMUnified">{{cite arXiv \|last1=Zhou \|first1=Kaizheng \|last2=Zhang \|first2=Yulai \|last3=Tao \|first3=Dacheng \|title=Sharpness-Aware Minimization: A Unified View and A New Theory \|eprint=2305.10276 \|year=2023 \|class=cs.LG}}</ref> == References == {{reflist}} ▲{{Uncategorized\|date=June 2025}} [[Category:Machine learning algorithms]] [[Category:Optimization algorithms and methods]]