Symbolic regression: Difference between revisions

Content deleted Content added
End-user software: Removed second person
HFD90 (talk | contribs)
mNo edit summary
Line 7:
No particular model is provided as a starting point for symbolic regression. Instead, initial expressions are formed by randomly combining mathematical building blocks such as [[Operation (mathematics)|mathematical operators]], [[analytic function]]s, [[Constant (mathematics)|constants]], and [[state variable]]s. Usually, a subset of these primitives will be specified by the person operating it, but that's not a requirement of the technique. The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including recombining equations most commonly using [[genetic programming]],<ref name="schmidt2009distilling"/> as well as more recent methods utilizing [[Bayesian statistics#Outline of Bayesian methods|Bayesian methods]]<ref name="bayesian"/> and [[Artificial neural network|neural networks]].<ref name="aifeynman"/> Another non-classical alternative method to SR is called Universal Functions Originator (UFO), which has a different mechanism, search-space, and building strategy.<ref name="ufo"/> Further methods such as Exact Learning attempt to transform the fitting problem into a [[Method of moments (statistics)|moments problem]] in a natural function space, usually built around generalizations of the [[Meijer G-function|Meijer-G function]].<ref name="exactlearning"/>
 
By not requiring ''a priori'' specification of a model, symbolic regression isn't affected by human bias, or unknown gaps in [[___domain knowledge]]. It attempts to uncover the intrinsic relationships of the dataset, by letting the patterns in the data itself reveal the appropriate models, rather than imposing a model structure that is deemed mathematically tractable from a human perspective. The [[fitness function]] that drives the evolution of the models takes into account not only [[Residual (numerical analysis)|error metrics]] (to ensure the models accurately predict the data), but also special complexity measures,<ref name="complexity"/> thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and favors the odds of getting insights about the data-generating system, as well as improving generalisability and extrapolation behaviour by preventing [[overfitting]]. Accuracy and simplicity may be left as two separate objectives of the regression -- in which case the optimum solutions form a [[Pareto front]] -- or they may be combined into a single objective by means of a model selection principle such as [[minimum description length]].
 
It has been proven that symbolic regression is an [[NP-hardness|NP-hard]] problem, in the sense that one cannot always find the best possible mathematical expression to fit to a given dataset in [[Polynomial-time|polynomial time]].<ref>{{Cite journal |last1=Virgolin |first1=Marco |last2=Pissis |first2=Solon P. |date=2022-07-05 |title=Symbolic Regression is NP-hard |url=http://arxiv.org/abs/2207.01018 |arxiv=2207.01018 }}</ref> Nevertheless, if the sought-for equation is not too complex it is possible to solve the symbolic regression problem exactly by generating every possible function (built from some predefined set of operators) and evaluating them on the dataset in question.<ref>{{cite journal |last1=Bartlett|first1=Deaglan|last2=Desmond|first2=Harry|last3=Ferreira|first3=Pedro|title=Exhaustive Symbolic Regression|date=November 2022 |arxiv=2211.11461}}</ref>
 
== Difference from classical regression ==