Probability distribution fitting: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:10, 8 November 2022 edit Plurm (talk \| contribs) 7 edits m Undid revision 1120200324 by 91.186.232.206 (talk) Tag: Undo ← Previous edit		Latest revision as of 07:45, 17 April 2025 edit undo Helper201 (talk \| contribs) Extended confirmed users 93,372 edits Stopped an image intruding on the references section.
(10 intermediate revisions by 9 users not shown)
Line 1: {{Short description\|Mathematical concept}} {{Use dmy dates\|date=October 2020}} '''Probability distribution fitting''' or simply '''distribution fitting''' is the fitting of a [[probability distribution]] to a series of data concerning the repeated measurement of a variable phenomenon. Line 27 ⟶ 28: The following techniques of distribution fitting exist:<ref>''Frequency and Regression Analysis''. Chapter 6 in: H.P.Ritzema (ed., 1994), ''Drainage Principles and Applications'', Publ. 16, pp. 175–224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. {{ISBN\|9070754339}}. Free download from the webpage [http://www.waterlog.info/articles.htm] under nr. 12, or directly as PDF : [http://www.waterlog.info/pdf/freqtxt.pdf]</ref> ''Parametric methods'', by which the [[parameter]]s of the distribution are calculated from the data series.<ref>H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)</ref> The parametric methods are: [[Method of moments (statistics)\|Method of moments]] [[Maximum spacing estimation]] Method of [[L-moment]]s<ref>{{cite journal \| last=Hosking \| first=J.R.M. \| year=1990 \| title=L-moments: analysis and estimation of distributions using linear combinations of order statistics \| journal=Journal of the Royal Statistical Society, Series B \| volume=52 \| issue=1 \| pages=105–124 \| jstor=2345653}}</ref> [[Maximum likelihood]] method<ref>{{cite journal \| last = Aldrich \| first = John \| title = R. A. Fisher and the making of maximum likelihood 1912–1922 \| year = 1997 \| journal = Statistical Science \| volume = 12 \| issue = 3 \| pages = 162–176 \| doi = 10.1214/ss/1030037906 \| mr = 1617519 \| ref = citeref Aldrich1997\| doi-access = free }}</ref> ::{\| class="wikitable" \| bgcolor="white" \| ''For example, the parameter <math>\mu</math> (the'' ''[[expected value\|expectation]]) can be estimated by the [[Arithmetic mean\|mean]] of the data and the parameter <math>\sigma^2</math> (the [[variance]]) can be estimated from the [[standard deviation]] of the data. The mean is found as <math display="inline">m=\sum{X}/n</math>, where <math>X</math> is the data value and <math>n</math> the number of data, while the standard deviation is calculated as <math display="inline">s = \sqrt{\frac{1}{n-1} \sum{(X-m)^2}}</math>. With these parameters many distributions, e.g. the normal distribution, are completely defined.'' \|} [[File:FitGumbelDistr.tif\|thumb\|220px\|Cumulative Gumbel distribution fitted to maximum one-day October rainfalls in [[~~Surinam~~Suriname]] by the regression method with added '''[[confidence band]]''' using [[CumFreq\|cumfreq]] ]] [[Plotting position]] plus [[Regression analysis]], using a transformation of the [[cumulative distribution function]] so that a [[linear relation]] is found between the [[cumulative probability]] and the values of the data, which may also need to be transformed, depending on the selected probability distribution. In this method the cumulative probability needs to be estimated by the [[plotting position]]<ref name="gen">Software for Generalized and Composite Probability Distributions. International Journal of Mathematical and Computational Methods, 4, 1-9 [https://www.iaras.org/iaras/home/caijmcm/software-for-generalized-and-composite-probability-distributions] or [https://www.waterlog.info/pdf/MathJournal.pdf]</ref> Line 68 ⟶ 69: == Shifting of distributions == Some probability distributions, like the [[exponential distribution\|exponential]], do not support negative data values (''X'') ~~equal to or less than zero~~. Yet, when negative data are present, such distributions can still be used replacing ''X'' by ''Y''=''X''-''Xm'', where ''Xm'' is the minimum value of ''X''. This replacement represents a shift of the probability distribution in positive direction, i.e. to the right, because ''Xm'' is negative. After completing the distribution fitting of ''Y'', the corresponding ''X''-values are found from ''X''=''Y''+''Xm'', which represents a back-shift of the distribution in negative direction, i.e. to the left.<br> The technique of distribution shifting augments the chance to find a properly fitting probability distribution. Line 77 ⟶ 78: == Uncertainty of prediction == [[File:BinomialConfBelts.jpg\|thumb\|<small>Uncertainty analysis with confidence belts using the binomial distribution </small><ref>Frequency predictions and their binomial confidence limits. In: International Commission on Irrigation and Drainage, Special Technical Session: Economic Aspects of Flood Control and non-Structural Measures, Dubrovnik, Yugoslavia, 1988. [http://www.waterlog.info/pdf/binomial.pdf On line]</ref>]] Predictions of occurrence based on fitted probability distributions are subject to [[uncertainty]], which arises from the following conditions: Line 90 ⟶ 91: With the binomial distribution one can obtain a [[prediction interval]]. Such an interval also estimates the risk of failure, i.e. the chance that the predicted event still remains outside the confidence interval. The confidence or risk analysis may include the [[return period]] ''T=1/Pe'' as is done in [[hydrology]]. === [[Variance]] of [[Bayesian inference\|Bayesian]] fitted probability functions === [[File:CumList.png\|thumb\|left\|List of probability distributions ranked by goodness of fit.<ref>[https://www.waterlog.info/cumfreq.htm Software for probability distribution fitting]</ref>]]▼ A Bayesian approach can be used for fitting a model <math>P(x\|\theta)</math> having a prior distribution <math>P(\theta)</math> for the parameter <math>\theta</math>. When one has samples <math>X</math> that are independently drawn from the underlying distribution then one can derive the so-called posterior distribution <math>P(\theta\|X)</math>. This posterior can be used to update the probability mass function for a new sample <math>x</math> given the observations <math>X</math>, one obtains <math display="block">P_\theta (x \| X) := \int d\theta\ P(x\|\theta)\ P(\theta\|X) .</math> The variance of the newly obtained probability mass function can also be determined. The variance for a Bayesian probability mass function can be defined as <math display="block">\sigma_{P_\theta(x\|X)}^2 := \int d\theta\ \left[ P(x\|\theta) - P_\theta(x\|X) \right]^2\ P(\theta\|X).</math> This expression for the variance can be substantially simplified (assuming independently drawn samples). Defining the "self probability mass function" as <math display="block">P_\theta(x\|\left\{X,x\right\}) = \int d\theta\ P(x\|\theta)\ P(\theta\|\left\{X, x\right\}),</math> one obtains for the variance<ref>{{Cite journal \|last1=Pijlman \|last2=Linnartz \|date=2023 \|title=Variance of Likelihood of data \|url=https://sitb2023.ulb.be/proceedings/ \|journal=SITB 2023 Proceedings \|pages=34}}</ref> <math display="block">\sigma_{P_\theta(x\|X)}^2 = P_\theta(x\|X) \left[ P_\theta(x\|\left\{X,x\right\}) - P_\theta(x\|X) \right].</math> ▲The expression for variance involves an additional fit that includes the sample <math>x</math> of interest.[[File:CumList.png\|thumb\|left\|List of probability distributions ranked by goodness of fit.<ref>[https://www.waterlog.info/cumfreq.htm Software for probability distribution fitting]</ref>]] [[File:GEVdistrHistogr+Density.png\|thumb\|220px\|Histogram and probability density of a data set fitting the [[GEV distribution]] ]] Line 107 ⟶ 125: * [[Mixture distribution]] * [[Product distribution]] {{clear}} == References ==