Sharpness aware minimization: Difference between revisions

Content deleted Content added
Submitting using AfC-submit-wizard
Citation bot (talk | contribs)
Altered pages. Formatted dashes. | Use this bot. Report bugs. | Suggested by LeapTorchGear | #UCB_webform 183/435
Line 9:
{{technical}}
 
'''Sharpness Aware Minimization''' ('''SAM''') is an [[optimization algorithm]] designed to improve the [[generalization (machine learning)|generalization performance]] of [[machine learning]] models, particularly [[deep neural network]]s. Instead of merely seeking parameters that achieve low training [[loss function|loss]], SAM aims to find parameters that reside in ''neighborhoods'' of uniformly low loss, effectively favoring "flat" minima in the loss landscape over "sharp" ones. The intuition is that models converging to flatter minima are more robust to variations between training and test [[data set|data distributions]], leading to better generalization.<ref name="Foret2021">{{cite conference |last1=Foret |first1=Pierre |last2=Kleiner |first2=Ariel |last3=Mobahi |first3=Hossein |last4=Neyshabur |first4=Behnam |title=Sharpness-Aware Minimization for Efficiently Improving Generalization |booktitlebook-title=International Conference on Learning Representations (ICLR) 2021 |year=2021 |arxiv=2010.01412 |url=https://openreview.net/forum?id=6Tm1m_rRrwY}}</ref>
 
SAM was introduced by Foret et al. in 2020 in the paper "Sharpness-Aware Minimization for Efficiently Improving Generalization".<ref name="Foret2021"/>
Line 38:
* '''Improved Generalization:''' SAM consistently leads to better generalization performance across a wide range of deep learning models (especially [[Convolutional Neural Network|Convolutional Neural Networks (CNNs)]] and [[Transformer (machine learning model)|Vision Transformers (ViTs)]]) and datasets (e.g., [[ImageNet]], [[CIFAR-10]], [[CIFAR-100]] dataset|CIFAR-100]]]).<ref name="Foret2021"/>
* '''State-of-the-Art Results:''' It has helped achieve [[state-of-the-art]] or near state-of-the-art performance on several benchmark image classification tasks.<ref name="Foret2021"/>
* '''Robustness to Label Noise:''' SAM inherently provides robustness to [[Label noise|noisy labels]] in training data, performing comparably to methods specifically designed for this purpose.<ref name="Wen2021Mitigating">{{cite arXiv |last1=Wen |first1=Yulei |last2=Liu |first2=Zhen |last3=Zhang |first3=Zhe |last4=Zhang |first4=Yilong |last5=Wang |first5=Linmi |last6=Zhang |first6=Tiantian |title=Mitigating Memorization in Sample Selection for Learning with Noisy Labels |eprint=2110.08529 |year=2021 |class=cs.LG}}</ref><ref name="Zhuang2022Surrogate">{{cite conference |last1=Zhuang |first1=Juntang |last2=Gong |first2=Ming |last3=Liu |first3=Tong |title=Surrogate Gap Minimization Improves Sharpness-Aware Training |booktitlebook-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=27098-2711527098–27115 |publisher=PMLR |url=https://proceedings.mlr.press/v162/zhuang22d.html}}</ref>
* '''[[Out-of-distribution generalization|Out-of-Distribution (OOD) Generalization]]:''' Studies have shown that SAM and its variants can improve a model's ability to generalize to data distributions different from the training distribution.<ref name="Croce2021SAMBayes">{{cite arXiv |last1=Croce |first1=Francesco |last2=Hein |first2=Matthias |title=SAM as an Optimal Relaxation of Bayes |eprint=2110.11214 |year=2021 |class=cs.LG}}</ref><ref name="Kim2022Slicing">{{cite conference |last1=Kim |first1=Daehyeon |last2=Kim |first2=Seungone |last3=Kim |first3=Kwangrok |last4=Kim |first4=Sejun |last5=Kim |first5=Jangho |title=Slicing Aided Hyper-dimensional Inference and Fine-tuning for Improved OOD Generalization |booktitlebook-title=Conference on Neural Information Processing Systems (NeurIPS) 2022 |year=2022 |url=https://openreview.net/forum?id=fN0K3jtnQG_}}</ref>
* '''Gradual [[Domain adaptation|Domain Adaptation]]:''' SAM has shown benefits in settings where models are adapted incrementally across changing data domains.<ref name="Liu2021Delving">{{cite arXiv |last1=Liu |first1=Sitong |last2=Zhou |first2=Pan |last3=Zhang |first3=Xingchao |last4=Xu |first4=Zhi |last5=Wang |first5=Guang |last6=Zhao |first6=Hao |title=Delving into SAM: An Analytical Study of Sharpness Aware Minimization |eprint=2111.00905 |year=2021 |class=cs.LG}}</ref>
* '''[[Overfitting]] Mitigation:''' It is particularly effective in scenarios where models might overfit due to seeing training examples multiple times.<ref name="Foret2021"/>
Line 46:
Despite its strengths, SAM also has limitations:
* '''Increased Computational Cost:''' The most significant drawback of SAM is its computational overhead. Since it requires two forward and backward passes per optimization step, it roughly doubles the training time compared to standard optimizers.<ref name="Foret2021"/>
* '''Convergence Guarantees:''' While empirically successful, theoretical understanding of SAM's [[Convergence of an algorithm|convergence properties]] is still evolving. Some works suggest SAM might have limited capability to converge to global minima or precise stationary points with constant step sizes.<ref name="Andriushchenko2022Understanding">{{cite conference |last1=Andriushchenko |first1=Maksym |last2=Flammarion |first2=Nicolas |title=Towards Understanding Sharpness-Aware Minimization |booktitlebook-title=International Conference on Machine Learning (ICML) 2022 |year=2022 |pages=612-639612–639 |publisher=PMLR |url=https://proceedings.mlr.press/v162/andriushchenko22a.html}}</ref>
* '''Effectiveness of Sharpness Approximation:''' The one-step gradient ascent used to approximate the worst-case perturbation <math>\epsilon</math> might become less accurate as training progresses.<ref name="Kwon2021ASAM">{{cite conference |last1=Kwon |first1=Jungmin |last2=Kim |first2=Jeongseop |last3=Park |first3=Hyunseo |last4=Choi |first4=Il-Chul |title=ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks |booktitlebook-title=International Conference on Machine Learning (ICML) 2021 |year=2021 |pages=5919-59295919–5929 |publisher=PMLR |url=https://proceedings.mlr.press/v139/kwon21a.html}}</ref> Multi-step ascent could be more accurate but would further increase computational costs.
* '''Domain-Specific Efficacy:''' While highly effective in [[computer vision]], its benefits might be less pronounced or require careful tuning in other domains. For instance, some studies found limited or no improvement for [[GPT model|GPT-style language models]] that process each training example only once.<ref name="Chen2023SAMLLM">{{cite arXiv |last1=Chen |first1=Xian |last2=Zhai |first2=Saining |last3=Chan |first3=Crucian |last4=Le |first4=Quoc V. |last5=Houlsby |first5=Graham |title=When is Sharpness-Aware Minimization (SAM) Effective for Large Language Models? |eprint=2308.04932 |year=2023 |class=cs.LG}}</ref>
* '''Potential for Finding "Poor" Flat Minima:''' While the goal is to find generalizing flat minima, some research indicates that in specific settings, sharpness minimization algorithms might converge to flat minima that do not generalize well.<ref name="Liu2023SAMOOD">{{cite conference |last1=Liu |first1=Kai |last2=Li |first2=Yifan |last3=Wang |first3=Hao |last4=Liu |first4=Zhen |last5=Zhao |first5=Jindong |title=When Sharpness-Aware Minimization Meets Data Augmentation: Connect the Dots for OOD Generalization |booktitlebook-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=Nc0e196NhF}}</ref>
* '''Hyperparameter Sensitivity:''' SAM introduces new hyperparameters, such as the neighborhood size <math>\rho</math>, which may require careful tuning for optimal performance.<ref name="Foret2021"/>
 
Line 57:
** '''SAMPa (SAM Parallelized):''' Modifies SAM to allow the two gradient computations to be performed in parallel.<ref name="Dou2022SAMPa">{{cite arXiv |last1=Dou |first1=Yong |last2=Zhou |first2=Cong |last3=Zhao |first3=Peng |last4=Zhang |first4=Tong |title=SAMPa: A Parallelized Version of Sharpness-Aware Minimization |eprint=2202.02081 |year=2022 |class=cs.LG}}</ref>
** '''Sparse SAM (SSAM):''' Applies the adversarial perturbation to only a subset of the model parameters.<ref name="Chen2022SSAM">{{cite arXiv |last1=Chen |first1=Wenlong |last2=Liu |first2=Xiaoyu |last3=Yin |first3=Huan |last4=Yang |first4=Tianlong |title=Sparse SAM: Squeezing Sharpness-aware Minimization into a Single Forward-backward Pass |eprint=2205.13516 |year=2022 |class=cs.LG}}</ref>
** '''Single-Step/Reduced-Step SAM:''' Variants that approximate the sharpness-aware update with fewer computations, sometimes using historical gradient information (e.g., S2-SAM,<ref name="Zhuang2022S2SAM">{{cite arXiv |last1=Zhuang |first1=Juntang |last2=Liu |first2=Tong |last3=Tao |first3=Dacheng |title=S2-SAM: A Single-Step, Zero-Extra-Cost Approach to Sharpness-Aware Training |eprint=2206.08307 |year=2022 |class=cs.LG}}</ref> Momentum-SAM<ref name="He2021MomentumSAM">{{cite arXiv |last1=He |first1=Zequn |last2=Liu |first2=Sitong |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Zhang |first5=Cong |last6=Xu |first6=Zhi |last7=Zhao |first7=Hao |title=Momentum Sharpness-Aware Minimization |eprint=2110.03265 |year=2021 |class=cs.LG}}</ref>) or applying SAM steps intermittently. Lookahead SAM<ref name="Liu2022LookaheadSAM">{{cite conference |last1=Liu |first1=Sitong |last2=He |first2=Zequn |last3=Zhang |first3=Xingchao |last4=Zhou |first4=Pan |last5=Xu |first5=Zhi |last6=Zhang |first6=Cong |last7=Zhao |first7=Hao |title=Lookahead Sharpness-aware Minimization |booktitlebook-title=International Conference on Learning Representations (ICLR) 2022 |year=2022 |url=https://openreview.net/forum?id=7s38W2293F}}</ref> also aims to reduce overhead.
* '''Understanding SAM's Behavior:'''
** '''Implicit Bias Studies:''' Research has shown that SAM has an implicit bias towards flatter minima, and even applying SAM for only a few epochs late in training can yield significant generalization benefits.<ref name="Wen2022SAMLandscape">{{cite arXiv |last1=Wen |first1=Yulei |last2=Zhang |first2=Zhe |last3=Liu |first3=Zhen |last4=Li |first4=Yue |last5=Zhang |first5=Tiantian |title=How Does SAM Influence the Loss Landscape? |eprint=2203.08065 |year=2022 |class=cs.LG}}</ref>
** '''Component Analysis:''' Investigations into which components of the gradient contribute most to SAM's effectiveness in the perturbation step.<ref name="Liu2023FriendlySAM">{{cite conference |last1=Liu |first1=Kai |last2=Wang |first2=Hao |last3=Li |first3=Yifan |last4=Liu |first4=Zhen |last5=Zhang |first5=Runpeng |last6=Zhao |first6=Jindong |title=Friendly Sharpness-Aware Minimization |booktitlebook-title=International Conference on Learning Representations (ICLR) 2023 |year=2023 |url=https://openreview.net/forum?id=RndGzfJl4y}}</ref>
* '''Performance and Robustness Enhancements:'''
** '''Adaptive SAM (ASAM):''' Introduces adaptive neighborhood sizes, making the method scale-invariant with respect to the parameters.<ref name="Kwon2021ASAM"/>
Line 74:
* '''Deepening Theoretical Understanding:'''
** Providing tighter generalization bounds that fully explain SAM's empirical success.<ref name="Foret2021"/><ref name="Andriushchenko2022Understanding"/><ref name="Neyshabur2021WhatTransferred">{{cite arXiv |last1=Neyshabur |first1=Behnam |last2=Sedghi |first2=Hanie |last3=Zhang |first3=Chiyuan |title=What is being Transferred in Transfer Learning? |eprint=2008.11687 |year=2020 |class=cs.LG}}</ref> ** Establishing more comprehensive convergence guarantees for SAM and its variants under diverse conditions.<ref name="Andriushchenko2022Understanding"/><ref name="Mi2022ConvergenceSAM">{{cite arXiv |last1=Mi |first1=Guanlong |last2=Lyu |first2=Lijun |last3=Wang |first3=Yuan |last4=Wang |first4=Lili |title=On the Convergence of Sharpness-Aware Minimization: A Trajectory and Landscape Analysis |eprint=2206.03046 |year=2022 |class=cs.LG}}</ref>
** Understanding the interplay between sharpness, flatness, and generalization, and why SAM-found minima often generalize well.<ref name="Liu2023SAMOOD"/><ref name="Jiang2020FantasticMeasures">{{cite conference |last1=Jiang |first1=Yiding |last2=Neyshabur |first2=Behnam |last3=Mobahi |first3=Hossein |last4=Krishnan |first4=Dilip |last5=Bengio |first5=Samy |title=Fantastic Generalization Measures and Where to Find Them |booktitlebook-title=International Conference on Learning Representations (ICLR) 2020 |year=2020 |url=https://openreview.net/forum?id=SJgMfnR9Y7}}</ref>
* '''Improved Sharpness Approximation:''' Designing more sophisticated and computationally feasible methods to find or approximate the "worst-case" loss in a neighborhood.<ref name="Kwon2021ASAM"/>
* '''Hyperparameter Optimization and Robustness:''' Developing adaptive methods for setting SAM's hyperparameters (like <math>\rho</math>) or reducing its sensitivity to them.<ref name="Kwon2021ASAM"/>