Revision as of 14:25, 5 February 2021 edit DavidAugustat (talk \| contribs) 12 edits →Use: Improved poor writing style ← Previous edit		Revision as of 10:02, 12 March 2021 edit undo Martin.monperrus (talk \| contribs) 437 edits →Limitations of automatic bug-fixing: add structure and headers to better understand the limitations Tag: 2017 wikitext editor Next edit →
Line 59: == Limitations of automatic bug-fixing == ===== Overfitting ===== Automatic bug-fixing techniques that rely on a test suite do not provide patch correctness guarantees, because the test suite is incomplete and does not cover all cases.<ref name="kali" /> A weak test suite may cause generate-and-validate techniques to produce validated but incorrect patches that have negative effects such as eliminating desirable functionalities, causing memory leaks, and introducing security vulnerabilities.<ref name="kali" /> One possible approach is to amplify the failing test suite by automatically generating further test cases that are then labelled as passing or failing. To minimize the human labelling effort, an automatic [[test oracle]] can be trained that gradually learns to automatically classify test cases as passing or failing and only engages the bug-reporting user for uncertain cases.<ref name="learn2fix">{{cite book\|title=Proceedings of the 13th International Conference on Software Testing, Validation and Verification\|series=ICST 2020\|___location=Porto, Portugal\|publisher=IEEE\|doi=10.1109/ICST46399.2020.00036\|last1=Böhme\|first1=Marcel\|last2=Geethal\|first2=Charaka\|last3=Pham\|first3=Van-Thuan\|date=2020\|pages=274–285\|chapter=Human-In-The-Loop Automatic Program Repair\|arxiv=1912.07758\|isbn=978-1-7281-5778-8\|s2cid=209386817}}</ref> Line 64 ⟶ 66: Sometimes, in test-suite based program repair, tools generate patches that pass the test suite, yet are actually incorrect, this is known as the "overfitting" problem.<ref name=overfitting>{{cite book \|last1=Smith \|first1=Edward K. \|last2=Barr \|first2=Earl T. \|last3=Le Goues \|first3=Claire \|last4=Brun \|first4=Yuriy \|date=2015 \|chapter=Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair \|title=Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering \|series=ESEC/FSE 2015 \|___location=New York, New York \|publisher=ACM \|pages=532–543 \|isbn=978-1-4503-3675-8 \|doi=10.1145/2786805.2786825 \|s2cid=6300790 }}</ref> "Overfitting" in this context refers to the fact that the patch overfits to the test inputs. There are different kinds of overfitting:<ref name="Yu2018">{{cite journal\|last1=Yu\|first1=Zhongxing\|last2=Martinez\|first2=Matias\|last3=Danglot\|first3=Benjamin\|last4=Durieux\|first4=Thomas\|last5=Monperrus\|first5=Martin\|title=Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system\|journal=Empirical Software Engineering\|volume=24\|pages=33–67\|year=2018\|issn=1382-3256\|doi=10.1007/s10664-018-9619-4\|arxiv=1810.10614\|bibcode=2018arXiv181010614Y\|s2cid=21659819}}</ref> incomplete fixing means that only some buggy inputs are fixed, regression introduction means some previously working features are broken after the patch (because they were poorly tested). Early prototypes for automatic repair suffered a lot from overfitting: on the Manybugs C benchmark, Qi et al.<ref name="kali" /> reported that 104/110 of plausible GenProg patches were overfitting; on the Defects4J Java benchmark, Martinez et al.<ref name="martinezdefects4j">{{Cite journal\|last1=Martinez\|first1=Matias\|last2=Durieux\|first2=Thomas\|last3=Sommerard\|first3=Romain\|last4=Xuan\|first4=Jifeng\|last5=Monperrus\|first5=Martin\|date=2016-10-25\|title=Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset\|url=https://hal.archives-ouvertes.fr/hal-01387556/document\|journal=Empirical Software Engineering\|language=en\|volume=22\|issue=4\|pages=1936–1964\|doi=10.1007/s10664-016-9470-4\|issn=1382-3256\|arxiv=1811.02429\|s2cid=24538587}}</ref> reported that 73/84 plausible patches as overfitting. In the context of synthesis-based repair, Le et al.<ref>{{Cite journal\|last1=Le\|first1=Xuan Bach D.\|last2=Thung\|first2=Ferdian\|last3=Lo\|first3=David\|last4=Goues\|first4=Claire Le\|date=2018-03-02\|title=Overfitting in semantics-based automated program repair\|journal=Empirical Software Engineering\|language=en\|volume=23\|issue=5\|pages=3007–3033\|doi=10.1007/s10664-017-9577-2\|s2cid=3635768\|issn=1382-3256\|url=https://ink.library.smu.edu.sg/sis_research/3986}}</ref> obtained more than 80% of overfitting patches. ===== Search Space Explosion ===== Another limitation of generate-and-validate systems is the search space explosion.<ref name=spaceanalysis>{{cite book \|last1=Long \|first1=Fan \|last2=Rinard \|first2=Martin \|date=2016 \|chapter=An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems \|title=Proceedings of the 38th International Conference on Software Engineering \|series=ICSE '16 \|___location=New York, New York \|publisher=ACM \|pages=702–713 \|isbn=978-1-4503-3900-1 \|doi=10.1145/2884781.2884872 \|hdl=1721.1/113656 \|arxiv=1602.05643 \|s2cid=7426809 }}</ref> For a program, there are a large number of statements to change and for each statement there are a large number of possible modifications. State-of-the-art systems address this problem by assuming that a small modification is enough for fixing a bug, resulting in a search space reduction.▼ ▲~~Another~~A limitation of generate-and-validate repair systems is the search space explosion.<ref name=spaceanalysis>{{cite book \|last1=Long \|first1=Fan \|last2=Rinard \|first2=Martin \|date=2016 \|chapter=An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems \|title=Proceedings of the 38th International Conference on Software Engineering \|series=ICSE '16 \|___location=New York, New York \|publisher=ACM \|pages=702–713 \|isbn=978-1-4503-3900-1 \|doi=10.1145/2884781.2884872 \|hdl=1721.1/113656 \|arxiv=1602.05643 \|s2cid=7426809 }}</ref> For a program, there are a large number of statements to change and for each statement there are a large number of possible modifications. State-of-the-art systems address this problem by assuming that a small modification is enough for fixing a bug, resulting in a search space reduction. The limitation of approaches based on symbolic analysis<ref name="semfix" /><ref name="angelix" /> is that real world programs are often converted to intractably large formulas especially for modifying statements with [[side effect (computer science)\|side effects]].

Automatic bug fixing: Difference between revisions