Probability distribution fitting: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:25, 22 June 2017 edit LilHelpa (talk \| contribs) Extended confirmed users, Pending changes reviewers 413,966 edits m →Selection of distribution: clean up, replaced: mean while → meanwhile using AWB ← Previous edit		Latest revision as of 07:45, 17 April 2025 edit undo Helper201 (talk \| contribs) Extended confirmed users 93,372 edits Stopped an image intruding on the references section.
(46 intermediate revisions by 28 users not shown)
Line 1: {{Short description\|Mathematical concept}} {{Use dmy dates\|date=~~August~~October ~~2012~~2020}} '''Probability distribution fitting''' or simply '''distribution fitting''' is the fitting of a [[probability distribution]] to a series of data concerning the repeated measurement of a variable phenomenon. The aim of distribution fitting is to [[prediction\|predict]] the [[probability]] or to [[forecasting\|forecast]] the [[Frequency (statistics)\|frequency]] of occurrence of the magnitude of the phenomenon in a certain interval. There are many probability distributions (see [[list of probability distributions]]) of which some can be fitted more closely to the observed frequency of the data than others, depending on the characteristics of the phenomenon and of the distribution. The distribution giving a close fit is supposed to lead to good predictions. In distribution fitting, therefore, one needs to select a distribution that suits the data well. Line 11 ⟶ 10: [[File:Normal Distribution PDF.svg\|thumb\|Different shapes of the symmetrical normal distribution depending on mean ''μ'' and variance ''σ''<sup> 2</sup>]] The selection of the appropriate distribution depends on the presence or absence of symmetry of the data set with respect to the [[~~mean\|mean~~central ~~value~~tendency]]. ''Symmetrical distributions'' When the data are symmetrically distributed around the ~~meanwhile~~mean while the frequency of occurrence of data farther away from the mean diminishes, one may for example select the [[normal distribution]], the [[logistic distribution]], or the [[Student's t-distribution]]. The first two are very similar, while the last, with one degree of freedom, has "heavier tails" meaning that the values farther away from the mean occur relatively more often (i.e. the [[kurtosis]] is higher). The [[Cauchy distribution]] is also symmetric. ''Skew distributions to the right'' [[File:Negative and positive skew diagrams (English).svg\|thumb\|220px\|Skewness to left and right]] When the larger values tend to be farther away from the mean than the smaller values, one has a skew distribution to the right (i.e. there is positive [[skewness]]), one may for example select the [[lognormal distribution\|log-normal distribution]] (i.e. the log values of the data are [[normal distribution\|normally distributed]]), the [[loglogistic distribution\|log-logistic distribution]] (i.e. the log values of the data follow a [[logistic distribution]]), the [[Gumbel distribution]], the [[exponential distribution]], the [[Pareto distribution]], the [[Weibull distribution]], the [[Burr distribution]], or the [[Fréchet distribution]]. The last ~~three~~four distributions are bounded to the left. ''Skew distributions to the left'' When the smaller values tend to be farther away from the mean than the larger values, one has a skew distribution to the left (i.e. there is negative skewness), one may for example select the ''square-normal distribution'' (i.e. the normal distribution applied to the square of the data values),<ref name="skew">Left (negatively) skewed frequency histograms can be fitted to square Normal or mirrored Gumbel probability functions. On line: [https://www.researchgate.net/publication/338633570_Left_negatively_skewed_frequency_histograms_can_be_fitted_to_square_Normal_or_mirrored_Gumbel_probability_functions]</ref> the inverted (mirrored) Gumbel distribution,<ref name=skew/> the [[Dagum distribution]] (mirrored Burr distribution), or the [[Gompertz distribution]], which is bounded to the left. == Techniques of fitting == The following techniques of distribution fitting exist:<ref>''Frequency and Regression Analysis''. Chapter 6 in: H.P.Ritzema (ed., 1994), ''Drainage Principles and Applications'', Publ. 16, pp. 175–224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. {{ISBN \|9070754339}}. Free download from the webpage [http://www.waterlog.info/articles.htm] under nr. 12, or directly as PDF : [http://www.waterlog.info/pdf/freqtxt.pdf]</ref> ''Parametric methods'', by which the [[parameter]]s of the distribution are calculated from the data series.<ref>H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)</ref> The parametric methods are: [[~~method~~Method of moments (statistics)\|~~method~~Method of moments]] [[Maximum spacing estimation]] method of [[L-moment]]s<ref>{{cite journal \| last=Hosking \| first=J.R.M. \| year=1990 \| title=L-moments: analysis and estimation of distributions using linear combinations of order statistics \| journal=Journal of the Royal Statistical Society, Series B \| volume=52 \| pages=105–124 \| jstor=2345653}}</ref> Method of [[~~Maximum likelihood~~L-moment]] ~~method~~s<ref>{{cite journal \| last = ~~Aldrich~~Hosking \| first =J.R.M. ~~John~~\| year=1990 \| title =L-moments: ~~R. A. Fisher~~analysis and ~~the making~~estimation of ~~maximum~~distributions ~~likelihood~~using ~~1912–1922~~linear \|combinations ~~year~~of =order ~~1997~~statistics \| journal =Journal ~~Statistical~~of ~~Science~~the \|Royal ~~volume~~Statistical =Society, 12Series ~~\| issue = 3~~B \| ~~pages~~ volume= ~~162–176~~52 \| ~~doi~~ issue= ~~10.1214/ss/1030037906~~1 \| mr pages= ~~1617519~~105–124 \| ~~ref~~ jstor= ~~citeref Aldrich1997~~2345653}}</ref> [[Maximum likelihood]] method<ref>{{cite journal \| last = Aldrich \| first = John \| title = R. A. Fisher and the making of maximum likelihood 1912–1922 \| year = 1997 \| journal = Statistical Science \| volume = 12 \| issue = 3 \| pages = 162–176 \| doi = 10.1214/ss/1030037906 \| mr = 1617519 \| ref = citeref Aldrich1997\| doi-access = free }}</ref> ::{\| class="wikitable" \| bgcolor="white" \| ''For example, the parameter <math>\mu</math> (the'' ''[[expected value\|expectation]]) can be estimated by the [[Arithmetic mean\|mean]] of the data and the parameter <math>\sigma^2</math> (the [[variance]]) can be estimated from the [[standard deviation]] of the data. The mean is found as <math display="inline">m=\sum{X}/n</math>, where <math>X</math> is the data value and <math>n</math> the number of data, while the standard deviation is calculated as <math display="inline">s = \sqrt{\frac{1}{n-1} \sum{(X-m)^2}}</math>. With these parameters many distributions, e.g. the normal distribution, are completely defined.'' \|} [[File:FitGumbelDistr.tif\|thumb\|220px\|Cumulative Gumbel distribution fitted to maximum one-day October rainfalls in [[~~Surinam~~Suriname]] by the regression method with added '''[[confidence band]]''' using [[CumFreq\|cumfreq]] ]] ''[[Plotting position]] plus [[Regression ~~method''~~analysis]], using a transformation of the [[cumulative distribution function]] so that a [[linear relation]] is found between the [[cumulative probability]] and the values of the data, which may also need to be transformed, depending on the selected probability distribution. In this method the cumulative probability needs to be estimated by the [[plotting position]]<ref name="gen">Software for Generalized and Composite Probability Distributions. International Journal of Mathematical and Computational Methods, 4, 1-9 [https://www.iaras.org/iaras/home/caijmcm/software-for-generalized-and-composite-probability-distributions] or [https://www.waterlog.info/pdf/MathJournal.pdf]</ref> ::{\| class="wikitable" \| bgcolor="white" \|For example, the cumulative [[Gumbel distribution]] can be linearized to <math>Y=aX+b</math>, where <math>X</math> is the data variable and <math>Y=-\ln(-\ln P)</math>, with <math>P</math> being the cumulative probability, i.e. the probability that the data value is less than <math>X</math>. Thus, using the [[plotting position]] for <math>P</math>, one finds the parameters <math>a</math> and <math>b</math> from a linear regression of <math>Y</math> on <math>X</math>, and the Gumbel distribution is fully defined. Line 48 ⟶ 49: To fit a symmetrical distribution to data obeying a negatively skewed distribution (i.e. skewed to the left, with [[mean]] < [[mode (statistics)\|mode]], and with a right hand tail this is shorter than the left hand tail) one could use the squared values of the data to accomplish the fit. More generally one can raise the data to a power ''p'' in order to fit symmetrical distributions to data obeying a distribution of any skewness, whereby ''p'' < 1 when the skewness is positive and ''p'' > 1 when the skewness is negative. The optimal value of ''p'' is to be found by a [[numerical method]]. The numerical method may consist of assuming a range of ''p'' values, then applying the distribution fitting procedure repeatedly for all the assumed ''p'' values, and finally selecting the value of ''p'' for which the sum of squares of deviations of calculated probabilities from measured frequencies ([[Chi-squared test\|chi ~~square~~squared]]) is minimum, as is done in [[CumFreq]]. The generalization enhances the flexibility of probability distributions and increases their applicability in distribution fitting.<ref name="gen"/> The versatility of generalization makes it possible, for example, to fit approximately normally distributed data sets to a large number of different probability distributions,<ref>Example of an approximately normally distributed data set to which a large number of different probability distributions can be fitted, [https://www.waterlog.info/pdf/Multiple%20fit.pdf]</ref> while negatively skewed distributions can be fitted to square normal and mirrored Gumbel distributions.<ref>Left (negatively) skewed frequency histograms can be fitted to square normal or mirrored Gumbel probability functions. [https://www.waterlog.info/pdf/LeftSkew.pdf]</ref> == Inversion of skewness == Line 62 ⟶ 69: == Shifting of distributions == Some probability distributions, like the [[exponential distribution\|exponential]], do not support negative data values (''X'') ~~equal to or less than zero~~. Yet, when negative data are present, such distributions can still be used replacing ''X'' by ''Y''=''X''-''Xm'', where ''Xm'' is the minimum value of ''X''. This replacement represents a shift of the probability distribution in positive direction, i.e. to the right, because ''Xm'' is negative. After completing the distribution fitting of ''Y'', the corresponding ''X''-values are found from ''X''=''Y''+''Xm'', which represents a back-shift of the distribution in negative direction, i.e. to the left.<br> The technique of distribution shifting augments the chance to find a properly fitting probability distribution. ==Composite distributions== [[File:SampleFreqCurves.tif\|thumb\|Variations of nine ''[[return period]]'' curves of 50-year samples from a theoretical 1000 year record (base line), data from Benson <ref>Benson, M.A. 1960. Characteristics of frequency curves based on a theoretical 1000 year record. In: T.Dalrymple (Ed.), Flood frequency analysis. U.S. Geological Survey Water Supply Paper, 1543-A, pp. 51-71.</ref>]]▼ [[File:SanLor.jpg\|thumb\|left\|Composite (discontinuous) distribution with confidence belt<ref>[https://www.waterlog.info/composite.htm Intro to composite probability distributions]</ref> ]] == Uncertainty of prediction ==▼ The option exists to use two different probability distributions, one for the lower data range, and one for the higher like for example the [[Laplace distribution]]. The ranges are separated by a break-point. The use of such composite (discontinuous) probability distributions can be opportune when the data of the phenomenon studied were obtained under two sets different conditions.<ref name=gen/> ▲== Uncertainty of prediction == [[File:BinomialConfBelts.jpg\|thumb\|<small>Uncertainty analysis with confidence belts using the binomial distribution </small> <ref>Frequency predictions and their binomial confidence limits. In: International Commission on Irrigation and Drainage, Special Technical Session: Economic Aspects of Flood Control and non -Structural Measures, Dubrovnik, ~~Yougoslavia~~Yugoslavia, 1988. [http://www.waterlog.info/pdf/binomial.pdf On line]</ref>]] Predictions of occurrence based on fitted probability distributions are subject to [[uncertainty]], which arises from the following conditions: Line 75 ⟶ 84: * The occurrence of events in another situation or in the future may deviate from the fitted distribution as this occurrence can also be subject to random error * A change of environmental conditions may cause a change in the probability of occurrence of the phenomenon ▲[[File:SampleFreqCurves.tif\|thumb\|left\|Variations of nine ''[[return period]]'' curves of 50-year samples from a theoretical 1000 year record (base line), data from Benson <ref>Benson, M.A. 1960. Characteristics of frequency curves based on a theoretical 1000 year record. In: T.Dalrymple (Ed.), Flood frequency analysis. U.S. Geological Survey Water Supply Paper, 1543-A, pp. 51-71.</ref>]] An estimate of the uncertainty in the first and second case can be obtained with the [[Binomial distribution\|binomial probability distribution]] using for example the probability of exceedance ''Pe'' (i.e. the chance that the event ''X'' is larger than a reference value ''Xr'' of ''X'') and the probability of non-exceedance ''Pn'' (i.e. the chance that the event ''X'' is smaller than or equal to the reference value ''Xr'', this is also called [[cumulative probability]]). In this case there are only two possibilities: either there is exceedance or there is non-exceedance. This duality is the reason that the binomial distribution is applicable. With the binomial distribution one can obtain a [[~~confidence~~prediction interval]] ~~of the prediction~~. Such an interval also estimates the risk of failure, i.e. the chance that the predicted event still remains outside the confidence interval. The confidence or risk analysis may include the [[return period]] ''T=1/Pe'' as is done in [[hydrology]]. === [[Variance]] of [[Bayesian inference\|Bayesian]] fitted probability functions === A Bayesian approach can be used for fitting a model <math>P(x\|\theta)</math> having a prior distribution <math>P(\theta)</math> for the parameter <math>\theta</math>. When one has samples <math>X</math> that are independently drawn from the underlying distribution then one can derive the so-called posterior distribution <math>P(\theta\|X)</math>. This posterior can be used to update the probability mass function for a new sample <math>x</math> given the observations <math>X</math>, one obtains <math display="block">P_\theta (x \| X) := \int d\theta\ P(x\|\theta)\ P(\theta\|X) .</math> The variance of the newly obtained probability mass function can also be determined. The variance for a Bayesian probability mass function can be defined as <math display="block">\sigma_{P_\theta(x\|X)}^2 := \int d\theta\ \left[ P(x\|\theta) - P_\theta(x\|X) \right]^2\ P(\theta\|X).</math> This expression for the variance can be substantially simplified (assuming independently drawn samples). Defining the "self probability mass function" as <math display="block">P_\theta(x\|\left\{X,x\right\}) = \int d\theta\ P(x\|\theta)\ P(\theta\|\left\{X, x\right\}),</math> one obtains for the variance<ref>{{Cite journal \|last1=Pijlman \|last2=Linnartz \|date=2023 \|title=Variance of Likelihood of data \|url=https://sitb2023.ulb.be/proceedings/ \|journal=SITB 2023 Proceedings \|pages=34}}</ref> <math display="block">\sigma_{P_\theta(x\|X)}^2 = P_\theta(x\|X) \left[ P_\theta(x\|\left\{X,x\right\}) - P_\theta(x\|X) \right].</math> The expression for variance involves an additional fit that includes the sample <math>x</math> of interest.[[File:CumList.png\|thumb\|left\|List of probability distributions ranked by goodness of fit<ref>[https://www.waterlog.info/cumfreq.htm Software for probability distribution fitting]</ref>]] [[File:GEVdistrHistogr+Density.png\|thumb\|220px\|Histogram and probability density of a data set fitting the [[GEV distribution]] ]] ==Goodness of fit== By ranking the [[goodness of fit]] of various distributions one can get an impression of which distribution is acceptable and which is not. ==Histogram and density function== From the [[cumulative distribution function]] (CDF) one can derive a [[histogram]] and the [[probability density function]] (PDF). == See also == * [[Curve fitting]] * [[Density estimation]] * [[Mixture distribution]] * [[Product distribution]] {{clear}} == References ==