Content deleted Content added
Stopped an image intruding on the references section. |
|||
(17 intermediate revisions by 14 users not shown) | |||
Line 1:
{{Short description|Mathematical concept}}
{{Use dmy dates|date=October 2020}}
'''Probability distribution fitting''' or simply '''distribution fitting''' is the fitting of a [[probability distribution]] to a series of data concerning the repeated measurement of a variable phenomenon.
The aim of distribution fitting is to [[prediction|predict]] the [[probability]] or to [[forecasting|forecast]] the [[Frequency (statistics)|frequency]] of occurrence of the magnitude of the phenomenon in a certain interval.
There are many probability distributions (see [[list of probability distributions]]) of which some can be fitted more closely to the observed frequency of the data than others, depending on the characteristics of the phenomenon and of the distribution. The distribution giving a close fit is supposed to lead to good predictions.
In distribution fitting, therefore, one needs to select a distribution that suits the data well.
Line 29 ⟶ 28:
The following techniques of distribution fitting exist:<ref>''Frequency and Regression Analysis''. Chapter 6 in: H.P.Ritzema (ed., 1994), ''Drainage Principles and Applications'', Publ. 16, pp. 175–224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. {{ISBN|9070754339}}. Free download from the webpage [http://www.waterlog.info/articles.htm] under nr. 12, or directly as PDF : [http://www.waterlog.info/pdf/freqtxt.pdf]</ref>
*''Parametric methods'', by which the [[parameter]]s of the distribution are calculated from the data series.<ref>H. Cramér, "Mathematical methods of statistics"
**[[
**[[
**
**[[Maximum likelihood]] method<ref>{{cite journal | last = Aldrich | first = John | title = R. A. Fisher and the making of maximum likelihood 1912–1922 | year = 1997 | journal = Statistical Science | volume = 12 | issue = 3 | pages = 162–176 | doi = 10.1214/ss/1030037906 | mr = 1617519 | ref = citeref Aldrich1997| doi-access = free }}</ref>
::{| class="wikitable"
| bgcolor="white" | ''For example, the parameter <math>\mu</math> (the'' ''[[expected value|expectation]]) can be estimated by the [[Arithmetic mean|mean]] of the data and the parameter <math>\sigma^2</math> (the [[variance]]) can be estimated from the [[standard deviation]] of the data. The mean is found as <math display="inline">m=\sum{X}/n</math>, where <math>X</math> is the data value and <math>n</math> the number of data, while the standard deviation is calculated as <math display="inline">s = \sqrt{\frac{1}{n-1} \sum{(X-m)^2}}</math>. With these parameters many distributions, e.g. the normal distribution, are completely defined.''
|}
[[File:FitGumbelDistr.tif|thumb|220px|Cumulative Gumbel distribution fitted to maximum one-day October rainfalls in [[
*
::{| class="wikitable"
| bgcolor="white" |For example, the cumulative [[Gumbel distribution]] can be linearized to <math>Y=aX+b</math>, where <math>X</math> is the data variable and <math>Y=-\ln(-\ln P)</math>, with <math>P</math> being the cumulative probability, i.e. the probability that the data value is less than <math>X</math>. Thus, using the [[plotting position]] for <math>P</math>, one finds the parameters <math>a</math> and <math>b</math> from a linear regression of <math>Y</math> on <math>X</math>, and the Gumbel distribution is fully defined.
Line 51:
More generally one can raise the data to a power ''p'' in order to fit symmetrical distributions to data obeying a distribution of any skewness, whereby ''p'' < 1 when the skewness is positive and ''p'' > 1 when the skewness is negative. The optimal value of ''p'' is to be found by a [[numerical method]]. The numerical method may consist of assuming a range of ''p'' values, then applying the distribution fitting procedure repeatedly for all the assumed ''p'' values, and finally selecting the value of ''p'' for which the sum of squares of deviations of calculated probabilities from measured frequencies ([[Chi-squared test|chi squared]]) is minimum, as is done in [[CumFreq]].
The generalization enhances the flexibility of probability distributions and increases their applicability in distribution fitting.
The versatility of generalization makes it possible, for example, to fit approximately normally distributed data sets to a large number of different probability distributions,
distributed data set to which a large number of different probability distributions can be fitted, [https://www.waterlog.info/pdf/Multiple%20fit.pdf]
square normal and mirrored Gumbel distributions.
fitted to square normal or mirrored Gumbel probability functions.
[https://www.waterlog.info/pdf/LeftSkew.pdf]</ref>
Line 69:
== Shifting of distributions ==
Some probability distributions, like the [[exponential distribution|exponential]], do not support negative data values (''X'')
The technique of distribution shifting augments the chance to find a properly fitting probability distribution.
Line 78:
== Uncertainty of prediction ==
[[File:BinomialConfBelts.jpg|thumb|<small>Uncertainty analysis with confidence belts using the binomial distribution
Predictions of occurrence based on fitted probability distributions are subject to [[uncertainty]], which arises from the following conditions:
Line 91:
With the binomial distribution one can obtain a [[prediction interval]]. Such an interval also estimates the risk of failure, i.e. the chance that the predicted event still remains outside the confidence interval. The confidence or risk analysis may include the [[return period]] ''T=1/Pe'' as is done in [[hydrology]].
=== [[Variance]] of [[Bayesian inference|Bayesian]] fitted probability functions ===
[[File:CumList.png|thumb|left|List of probability distributions ranked by goodness of fit.<ref>[https://www.waterlog.info/cumfreq.htm Software for probability distribution fitting]</ref>]]▼
A Bayesian approach can be used for fitting a model <math>P(x|\theta)</math> having a prior distribution <math>P(\theta)</math> for the parameter <math>\theta</math>. When one has samples <math>X</math> that are independently drawn from the underlying distribution then one can derive the so-called posterior distribution <math>P(\theta|X)</math>. This posterior can be used to update the probability mass function for a new sample <math>x</math> given the observations <math>X</math>, one obtains
<math display="block">P_\theta (x | X) := \int d\theta\ P(x|\theta)\ P(\theta|X) .</math>
The variance of the newly obtained probability mass function can also be determined. The variance for a Bayesian probability mass function can be defined as
<math display="block">\sigma_{P_\theta(x|X)}^2 := \int d\theta\ \left[ P(x|\theta) - P_\theta(x|X) \right]^2\ P(\theta|X).</math>
This expression for the variance can be substantially simplified (assuming independently drawn samples). Defining the "self probability mass function" as
<math display="block">P_\theta(x|\left\{X,x\right\}) = \int d\theta\ P(x|\theta)\ P(\theta|\left\{X, x\right\}),</math>
one obtains for the variance<ref>{{Cite journal |last1=Pijlman |last2=Linnartz |date=2023 |title=Variance of Likelihood of data |url=https://sitb2023.ulb.be/proceedings/ |journal=SITB 2023 Proceedings |pages=34}}</ref>
<math display="block">\sigma_{P_\theta(x|X)}^2 = P_\theta(x|X) \left[ P_\theta(x|\left\{X,x\right\}) - P_\theta(x|X) \right].</math>
▲The expression for variance involves an additional fit that includes the sample <math>x</math> of interest.[[File:CumList.png|thumb|left|List of probability distributions ranked by goodness of fit
[[File:GEVdistrHistogr+Density.png|thumb|220px|Histogram and probability density of a data set fitting the [[GEV distribution]] ]]
Line 108 ⟶ 125:
* [[Mixture distribution]]
* [[Product distribution]]
{{clear}}
== References ==
|