Symbolic regression: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 03:54, 7 July 2025 edit MrsCandyN (talk \| contribs) 13 edits →Non-standard methods: Introducing highly cited criticism on the topic by state-of-the-art literature. ← Previous edit		Latest revision as of 12:14, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,873,354 edits Altered pages. Add: article-number, arxiv, bibcode. Removed URL that duplicated identifier. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 605/967
(One intermediate revision by one other user not shown)
Line 9: By not requiring ''a priori'' specification of a model, symbolic regression isn't affected by human bias, or unknown gaps in [[___domain knowledge]]. It attempts to uncover the intrinsic relationships of the dataset, by letting the patterns in the data itself reveal the appropriate models, rather than imposing a model structure that is deemed mathematically tractable from a human perspective. The [[fitness function]] that drives the evolution of the models takes into account not only [[Residual (numerical analysis)\|error metrics]] (to ensure the models accurately predict the data), but also special complexity measures,<ref name="complexity"/> thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and favors the odds of getting insights about the data-generating system, as well as improving generalisability and extrapolation behaviour by preventing [[overfitting]]. Accuracy and simplicity may be left as two separate objectives of the regression—in which case the optimum solutions form a [[Pareto front]]—or they may be combined into a single objective by means of a model selection principle such as [[minimum description length]]. It has been proven that symbolic regression is an [[NP-hardness\|NP-hard]] problem~~, in the sense that one cannot always find the best possible mathematical expression to fit to a given dataset in [[Polynomial-time\|polynomial time]]~~.<ref>{{cite journal \|last1=Virgolin \|first1=Marco \|last2=Pissis \|first2=Solon P. \|journal=Transactions on Machine Learning Research \|date=2022 \|title=Symbolic Regression is NP-hard \|arxiv=2207.01018 \|url=https://openreview.net/forum?id=LTiaPxqe2e }}</ref> Nevertheless, if the sought-for equation is not too complex it is possible to solve the symbolic regression problem exactly by generating every possible function (built from some predefined set of operators) and evaluating them on the dataset in question.<ref>{{cite journal \|last1=Bartlett\|first1=Deaglan\|last2=Desmond\|first2=Harry\|last3=Ferreira\|first3=Pedro\|title=Exhaustive Symbolic Regression\|journal=IEEE Transactions on Evolutionary Computation \|year=2023 \|volume=28 \|issue=4 \|page=1 \|doi=10.1109/TEVC.2023.3280250 \|arxiv=2211.11461\|s2cid=253735380 }}</ref> == Difference from classical regression == Line 47: Silviu-Marian Udrescu and [[Max Tegmark]] developed the "AI Feynman" algorithm,<ref>{{Cite journal \|last1=Udrescu \|first1=Silviu-Marian \|last2=Tegmark \|first2=Max \|date=2020-04-17 \|title=AI Feynman: A physics-inspired method for symbolic regression \|journal=Science Advances \|language=en \|volume=6 \|issue=16 \|pages=eaay2631 \|doi=10.1126/sciadv.aay2631 \|issn=2375-2548 \|pmc=7159912 \|pmid=32426452\|arxiv=1905.11481 \|bibcode=2020SciA....6.2631U }}</ref><ref>{{cite arXiv \|last1=Udrescu \|first1=Silviu-Marian \|last2=Tan \|first2=Andrew \|last3=Feng \|first3=Jiahai \|last4=Neto \|first4=Orisvaldo \|last5=Wu \|first5=Tailin \|last6=Tegmark \|first6=Max \|date=2020-12-16 \|title=AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity \|class=cs.LG \|eprint=2006.10782 }}</ref> which attempts symbolic regression by training a neural network to represent the mystery function, then runs tests against the neural network to attempt to break up the problem into smaller parts. For example, if <math>f(x_1, ..., x_i, x_{i+1}, ..., x_n) = g(x_1,..., x_i) + h(x_{i+1},..., x_n)</math>, tests against the neural network can recognize the separation and proceed to solve for <math>g</math> and <math>h</math> separately and with different variables as inputs. This is an example of [[Divide-and-conquer algorithm\|divide and conquer]], which reduces the size of the problem to be more manageable. AI Feynman also transforms the inputs and outputs of the mystery function in order to produce a new function which can be solved with other techniques, and performs [[dimensional analysis]] to reduce the number of independent variables involved. The algorithm was able to "discover" 100 equations from [[The Feynman Lectures on Physics]], while a leading software using evolutionary algorithms, [[Eureqa]], solved only 71. AI Feynman, in contrast to classic symbolic regression methods, requires a very large dataset in order to first train the neural network and is naturally biased towards equations that are common in elementary physics. Some researchers have pointed out that conventional symbolic regression techniques may struggle to generalize in systems with complex causal dependencies or non-explicit governing equations.<ref>{{cite journal \|last1=Zenil \|first1=Hector \|last2=Kiani \|first2=Narsis A. \|last3=Zea \|first3=Allan A. \|last4=Tegnér \|first4=Jesper \|title=Causal deconvolution by algorithmic generative models \|journal=Nature Machine Intelligence \|volume=1 \|issue=1 \|year=2019 \|pages=~~58-66~~58–66 \|doi=10.1038/s42256-018-0005-0 }}</ref> A more general approach was developed a conceptual framework for extracting generative rules from complex dynamical systems based on Algorithmic Information Theory (AIT).<ref>{{cite journal \| last=Zenil \| first=Hector \| title=Algorithmic Information Dynamics \| journal=Scholarpedia \| date=25 July 2020 \| volume=15 \| issue=7 \| doi=10.4249/scholarpedia.53143 \| doi-access=free \| bibcode=2020SchpJ..1553143Z \| hdl=10754/666314 \| hdl-access=free }}</ref> This framework, called Algorithmic Information Dynamics (AID), applies perturbation analysis to quantify the algorithmic complexity of system components and reconstruct phase spaces and causal mechanisms, including for discrete systems such as cellular automata. Unlike traditional symbolic regression, AID enables the inference of generative rules without requiring explicit kinetic equations, offering insights into the causal structure and reprogrammability of complex systems.<ref> {{cite book \| last1=Zenil \| first1=Hector \| last2=Kiani \| first2=Narsis A. \| last3=Tegner \| first3=Jesper \| title=Algorithmic Information Dynamics: A Computational Approach to Causality with Applications to Living Systems \| publisher=Cambridge University Press \| year=2023 \| doi=10.1017/9781108596619 \| isbn=978-1-108-59661-9 \| url=https://doi.org/10.1017/9781108596619}}</ref> == Software == Line 53: === End-user software === * [[QLattice]] is a quantum-inspired simulation and machine learning technology that helps search through an infinite list of potential mathematical models to solve a problem.<ref>{{Cite web\|url=https://docs.abzu.ai\|title=Feyn is a Python module for running the QLattice\|date=June 22, 2022}}</ref><ref name="srfeyn" /> * [https://github.com/hengzhe-zhang/EvolutionaryForest Evolutionary Forest] is a Genetic Programming-based automated feature construction algorithm for symbolic regression.<ref>{{Cite journal \|last1=Zhang \|first1=Hengzhe \|last2=Zhou \|first2=Aimin \|last3=Zhang \|first3=Hu \|date=August 2022 \|title=An Evolutionary Forest for Regression ~~\|url=https://ieeexplore.ieee.org/document/9656554~~ \|journal=IEEE Transactions on Evolutionary Computation \|volume=26 \|issue=4 \|pages=735–749 \|doi=10.1109/TEVC.2021.3136667 \|bibcode=2022ITEC...26..735Z \|issn=1089-778X~~\|url-access=subscription~~ }}</ref><ref>{{Cite journal \|last1=Zhang \|first1=Hengzhe \|last2=Zhou \|first2=Aimin \|last3=Chen \|first3=Qi \|last4=Xue \|first4=Bing \|last5=Zhang \|first5=Mengjie \|date=2023 \|title=SR-Forest: A Genetic Programming based Heterogeneous Ensemble Learning Method ~~\|url=https://ieeexplore.ieee.org/document/10040601~~ \|journal=IEEE Transactions on Evolutionary Computation \|volume=28 \|issue=5 \|pages=1484–1498 \|doi=10.1109/TEVC.2023.3243172 \|issn=1089-778X~~\|url-access=subscription~~ }}</ref> * [https://github.com/brendenpetersen/deep-symbolic-optimization uDSR] is a deep learning framework for symbolic optimization tasks<ref>{{Cite web\|url=https://github.com/brendenpetersen/deep-symbolic-optimization\|title=Deep symbolic optimization\|website=[[GitHub]] \|date=June 22, 2022}}</ref> * [https://github.com/darioizzo/dcgp/ dCGP], differentiable Cartesian Genetic Programming in python (free, open source) <ref>{{Cite web\|url=https://darioizzo.github.io/dcgp/\|title=Differentiable Cartesian Genetic Programming, v1.6 Documentation\|date=June 10, 2022}}</ref><ref>{{Cite journal\|title=Differentiable genetic programming\|first1=Dario\|last1=Izzo\|first2=Francesco\|last2=Biscani\|first3=Alessio\|last3=Mereta\|journal=Proceedings of the European Conference on Genetic Programming\|year=2016 \|arxiv=1611.04766 }}</ref> Line 116: \| pmid = 32426452 \| pmc = 7159912 \| arxiv = 1905.11481 \| bibcode = 2020SciA....6.2631U }}</ref><ref name="ufo">{{cite journal Line 124 ⟶ 125: \| volume = 94 \| year = 2020 \| ~~pages~~article-number = 106417 \| issn = 1568-4946 \| url = https://www.sciencedirect.com/science/article/pii/S1568494620303574 Line 143 ⟶ 144: \| url = http://symbolicregression.com/sites/SRDocuments/NonlinearityPreprint.pdf \| doi=10.1109/tevc.2008.926486 \| ~~s2cid~~bibcode = ~~12072764~~2009ITEC...13..333V \| s2cid = 12072764 }}</ref><ref name="exactlearning">{{cite web \| title = A Natural Representation of Functions for Exact Learning