Automatic bug fixing: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: doi added to citation with #oabot.
Mentions techniques to filter out overfitting patches in a top-level section "Overfitting"
Line 59:
 
If one explicitly enumerates all possible variants in a repair algorithm, this defines a design space for program repair.<ref name="Martinez2019">{{Cite journal |last=Martinez |first=Matias |last2=Monperrus |first2=Martin |date=2019 |title=Astor: Exploring the design space of generate-and-validate program repair beyond GenProg |journal=Journal of Systems and Software |volume=151 |pages=65–80 |arxiv=1802.03365 |doi=10.1016/j.jss.2019.01.069 |s2cid=3619320}}</ref> Each variant selects an algorithm involved at some point in the repair process (eg the fault localization algorithm), or selects a specific heuristic which yields different patches. For instance, in the design space of generate-and-validate program repair, there is one variation point about the granularity of the program elements to be modified: an expression, a statement, a block, etc.<ref name="Martinez2019" />
 
== Overfitting ==
 
Sometimes, in test-suite based program repair, tools generate patches that pass the test suite, yet are actually incorrect, this is known as the "overfitting" problem.<ref name="overfitting">{{Cite book |last=Smith |first=Edward K. |title=Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering |last2=Barr |first2=Earl T. |last3=Le Goues |first3=Claire |last4=Brun |first4=Yuriy |date=2015 |publisher=ACM |isbn=978-1-4503-3675-8 |series=ESEC/FSE 2015 |___location=New York, New York |pages=532–543 |chapter=Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair |doi=10.1145/2786805.2786825 |s2cid=6300790}}</ref> "Overfitting" in this context refers to the fact that the patch overfits to the test inputs. There are different kinds of overfitting:<ref name="Yu2018">{{Cite journal |last=Yu |first=Zhongxing |last2=Martinez |first2=Matias |last3=Danglot |first3=Benjamin |last4=Durieux |first4=Thomas |last5=Monperrus |first5=Martin |year=2018 |title=Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system |journal=Empirical Software Engineering |volume=24 |pages=33–67 |arxiv=1810.10614 |bibcode=2018arXiv181010614Y |doi=10.1007/s10664-018-9619-4 |issn=1382-3256 |s2cid=21659819}}</ref> incomplete fixing means that only some buggy inputs are fixed, regression introduction means some previously working features are broken after the patch (because they were poorly tested). Early prototypes for automatic repair suffered a lot from overfitting: on the Manybugs C benchmark, Qi et al.<ref name="kali" /> reported that 104/110 of plausible GenProg patches were overfitting; on the Defects4J Java benchmark, Martinez et al.<ref name="martinezdefects4j">{{Cite journal |last=Martinez |first=Matias |last2=Durieux |first2=Thomas |last3=Sommerard |first3=Romain |last4=Xuan |first4=Jifeng |last5=Monperrus |first5=Martin |date=2016-10-25 |title=Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset |url=https://hal.archives-ouvertes.fr/hal-01387556/document |journal=Empirical Software Engineering |language=en |volume=22 |issue=4 |pages=1936–1964 |arxiv=1811.02429 |doi=10.1007/s10664-016-9470-4 |issn=1382-3256 |s2cid=24538587}}</ref> reported that 73/84 plausible patches as overfitting. In the context of synthesis-based repair, Le et al.<ref>{{Cite journal |last=Le |first=Xuan Bach D. |last2=Thung |first2=Ferdian |last3=Lo |first3=David |last4=Goues |first4=Claire Le |date=2018-03-02 |title=Overfitting in semantics-based automated program repair |url=https://ink.library.smu.edu.sg/sis_research/3986 |journal=Empirical Software Engineering |language=en |volume=23 |issue=5 |pages=3007–3033 |doi=10.1007/s10664-017-9577-2 |issn=1382-3256 |s2cid=3635768}}</ref> obtained more than 80% of overfitting patches.
 
One way to avoid overfitting is to filter out the generated patches. This can be done based on dynamic analysis<ref>{{Cite journal|last=Xin|first=Qi|last2=Reiss|first2=Steven P.|date=2017-07-10|title=Identifying test-suite-overfitted patches through test case generation|url=http://dx.doi.org/10.1145/3092703.3092718|journal=Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis|___location=New York, NY, USA|publisher=ACM|doi=10.1145/3092703.3092718|isbn=978-1-4503-5076-1}}</ref>, or static code analysis of the generated patches<ref>{{Cite journal|last=Ye|first=He|last2=Gu|first2=Jian|last3=Martinez|first3=Matias|last4=Durieux|first4=Thomas|last5=Monperrus|first5=Martin|date=2021|title=Automated Classification of Overfitting Patches with Statically Extracted Code Features|url=https://arxiv.org/abs/1910.12057|journal=IEEE Transactions on Software Engineering|pages=1–1|doi=10.1109/tse.2021.3071750|issn=0098-5589}}</ref>.
 
== Limitations of automatic bug-fixing ==
 
Automatic bug-fixing techniques that rely on a test suite do not provide patch correctness guarantees, because the test suite is incomplete and does not cover all cases.<ref name="kali" /> A weak test suite may cause generate-and-validate techniques to produce validated but incorrect patches that have negative effects such as eliminating desirable functionalities, causing memory leaks, and introducing security vulnerabilities.<ref name="kali" /> One possible approach is to amplify the failing test suite by automatically generating further test cases that are then labelled as passing or failing. To minimize the human labelling effort, an automatic [[test oracle]] can be trained that gradually learns to automatically classify test cases as passing or failing and only engages the bug-reporting user for uncertain cases.<ref name="learn2fix">{{Cite book |last=Böhme |first=Marcel |title=Proceedings of the 13th International Conference on Software Testing, Validation and Verification |last2=Geethal |first2=Charaka |last3=Pham |first3=Van-Thuan |date=2020 |publisher=IEEE |isbn=978-1-7281-5778-8 |series=ICST 2020 |___location=Porto, Portugal |pages=274–285 |chapter=Human-In-The-Loop Automatic Program Repair |arxiv=1912.07758 |doi=10.1109/ICST46399.2020.00036 |s2cid=209386817}}</ref>
 
Sometimes, in test-suite based program repair, tools generate patches that pass the test suite, yet are actually incorrect, this is known as the "overfitting" problem.<ref name="overfitting">{{Cite book |last=Smith |first=Edward K. |title=Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering |last2=Barr |first2=Earl T. |last3=Le Goues |first3=Claire |last4=Brun |first4=Yuriy |date=2015 |publisher=ACM |isbn=978-1-4503-3675-8 |series=ESEC/FSE 2015 |___location=New York, New York |pages=532–543 |chapter=Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair |doi=10.1145/2786805.2786825 |s2cid=6300790}}</ref> "Overfitting" in this context refers to the fact that the patch overfits to the test inputs. There are different kinds of overfitting:<ref name="Yu2018">{{Cite journal |last=Yu |first=Zhongxing |last2=Martinez |first2=Matias |last3=Danglot |first3=Benjamin |last4=Durieux |first4=Thomas |last5=Monperrus |first5=Martin |year=2018 |title=Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system |journal=Empirical Software Engineering |volume=24 |pages=33–67 |arxiv=1810.10614 |bibcode=2018arXiv181010614Y |doi=10.1007/s10664-018-9619-4 |issn=1382-3256 |s2cid=21659819}}</ref> incomplete fixing means that only some buggy inputs are fixed, regression introduction means some previously working features are broken after the patch (because they were poorly tested). Early prototypes for automatic repair suffered a lot from overfitting: on the Manybugs C benchmark, Qi et al.<ref name="kali" /> reported that 104/110 of plausible GenProg patches were overfitting; on the Defects4J Java benchmark, Martinez et al.<ref name="martinezdefects4j">{{Cite journal |last=Martinez |first=Matias |last2=Durieux |first2=Thomas |last3=Sommerard |first3=Romain |last4=Xuan |first4=Jifeng |last5=Monperrus |first5=Martin |date=2016-10-25 |title=Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset |url=https://hal.archives-ouvertes.fr/hal-01387556/document |journal=Empirical Software Engineering |language=en |volume=22 |issue=4 |pages=1936–1964 |arxiv=1811.02429 |doi=10.1007/s10664-016-9470-4 |issn=1382-3256 |s2cid=24538587}}</ref> reported that 73/84 plausible patches as overfitting. In the context of synthesis-based repair, Le et al.<ref>{{Cite journal |last=Le |first=Xuan Bach D. |last2=Thung |first2=Ferdian |last3=Lo |first3=David |last4=Goues |first4=Claire Le |date=2018-03-02 |title=Overfitting in semantics-based automated program repair |url=https://ink.library.smu.edu.sg/sis_research/3986 |journal=Empirical Software Engineering |language=en |volume=23 |issue=5 |pages=3007–3033 |doi=10.1007/s10664-017-9577-2 |issn=1382-3256 |s2cid=3635768}}</ref> obtained more than 80% of overfitting patches.
 
A limitation of generate-and-validate repair systems is the search space explosion.<ref name="spaceanalysis">{{Cite book |last=Long |first=Fan |title=Proceedings of the 38th International Conference on Software Engineering |last2=Rinard |first2=Martin |date=2016 |publisher=ACM |isbn=978-1-4503-3900-1 |series=ICSE '16 |___location=New York, New York |pages=702–713 |chapter=An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems |arxiv=1602.05643 |doi=10.1145/2884781.2884872 |hdl=1721.1/113656 |s2cid=7426809}}</ref> For a program, there are a large number of statements to change and for each statement there are a large number of possible modifications. State-of-the-art systems address this problem by assuming that a small modification is enough for fixing a bug, resulting in a search space reduction.