Quantile-parameterized distribution: Difference between revisions

Content deleted Content added
Riskanal (talk | contribs)
Moments: fixed some links
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(11 intermediate revisions by 5 users not shown)
Line 1:
A '''Quantilequantile-parameterized distributionsdistribution (QPDsQPD)''' areis a probability distributions that areis directly parameterized by data. They were motivatedcreated byto meet the need for easy-to-use continuous probability distributions flexible enough to represent a wide range of uncertainties, such as those commonly encountered in business, economics, engineering, and science. Because QPDs are directly parameterized by data, they have the practical advantage of avoiding the intermediate step of [[Estimation theory|parameter estimation]], a time-consuming process that typically requires non-linear iterative methods to estimate probability-distribution parameters from data. Some QPDs have virtually unlimited shape flexibility and closed-form moments as well.
 
== History ==
The development of quantile-parameterized distributions was inspired by the practical need for flexible continuous probability distributions that are easy to fit to data. Historically, the [[Pearson distribution|Pearson]]<ref>Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, Vol 1, Second Edition, John Wiley & Sons, Ltd, 1994, pp. 15–25.</ref> and [[Norman Lloyd Johnson|Johnson]]<ref>[{{cite journal | url=https://www.jstor.org/stable/2332539?seq=1 Johnson,| N.jstor=2332539 L.| (1949). “Systemstitle=Systems of frequencyFrequency curvesCurves generatedGenerated by methodsMethods of translationTranslation | last1=Johnson | first1=N. BiometrikaL. | journal=Biometrika | year=1949 | volume=36 (| issue=1/2): | pages=149–176. | doi:=10.2307/2332539.] | pmid=18132090 | url-access=subscription }}</ref><ref>[{{cite journal | url=https://www.jstor.org/stable/2335422 Tadikamalla,| P.jstor=2335422 R.| andtitle=Systems Johnson,of N.Frequency L.Curves (1982).Generated “Systemsby Transformations of frequencyLogistic curvesVariables generated| bylast1=Tadikamalla transformations| offirst1=Pandu logisticR. variables| last2=Johnson | first2=Norman L. | journal=Biometrika. | year=1982 | volume=69 (| issue=2): | pages=461–465 | doi=10.]1093/biomet/69.2.461 | url-access=subscription }}</ref> families of distributions have been used when shape flexibility is needed. That is because both families can match the first four moments (mean, variance, skewness, and kurtosis) of any data set. In many cases, however, these distributions are either difficult to fit to data or not flexible enough to fit the data appropriately.
 
For example, the [[beta distribution]] is a flexible Pearson distribution that is frequently used to model percentages of a population. However, if the characteristics of this population are such that the desired [[cumulative distribution function]] (CDF) should run through certain specific CDF points, there may be no beta distribution that meets this need. Because the beta distribution has only two shape parameters, it cannot, in general, match even three specified CDF points. Moreover, the beta parameters that best fit such data can be found only by nonlinear iterative methods.
 
Practitioners of [[decision analysis]], needing distributions easily parameterized by three or more CDF points (e.g., because such points were specified as the result of an [[Expert elicitation|expert-elicitation process]]), originally invented quantile-parameterized distributions for this purpose. Keelin and Powley (2011)<ref name="KeelinPowley">[[{{Cite journal |doi:=10.1287/deca.1110.0213 |Keelin,title=Quantile-Parameterized T.W.Distributions and|year=2011 Powley,|last1=Keelin |first1=Thomas B.W. (2011).|last2=Powley “Quantile-parameterized|first2=Bradford distributionsW. |journal=Decision Analysis. |volume=8 (|issue=3): |pages=206–219.]] }}</ref> provided the original definition. Subsequently, Keelin (2016)<ref name="Keelin2016">[[{{Cite journal |doi:=10.1287/deca.2016.0338 |Keelin,title=The T.W.Metalog (Distributions |year=2016). “The|last1=Keelin Metalog|first1=Thomas DistributionsW. |journal=Decision Analysis. |volume=13 (|issue=4): |pages=243–277.]] }}</ref> developed the [[metalog distribution]]s, a family of quantile-parameterized distributions that has virtually unlimited shape flexibility, simple equations, and closed-form moments.
 
== Definition ==
Line 60:
 
=== Fitting to data ===
The coefficients <math>\boldsymbol a</math> can be determined from data by [[linear least squares]]. Given <math>m</math> data points <math>(x_i,y_i)</math> that are intended to characterize the CDF of a QPD, and <math>m \times n</math> matrix <math>\boldsymbol Y</math> whose elements consist of <math>g_j (y_i)</math>, then, so long as <math>\boldsymbol Y^T \boldsymbol Y</math> is invertible, coefficients' column vector <math>\boldsymbol a</math> can be determined as <math>\boldsymbol a=(\boldsymbol Y^T \boldsymbol Y)^{-1} \boldsymbol Y^T \boldsymbol x</math>, where <math>m\geq n</math> and column vector <math>\boldsymbol x=(x_1,\ldots,x_m)</math>. If <math>m=n</math>, this equation reduces to <math>\boldsymbol a=\boldsymbol Y^{-1} \boldsymbol x</math>, where the resulting CDF runs through all data points exactly. An alternate method, implemented as a linear program, determines the coefficients by minimizing the sum of absolute distances between the CDF and the data subject to feasibility constraints.<ref name="Faber">[{{Cite thesis |url=https://searchworks.stanford.edu/view/13257318 Faber, I.J. (2019). |title=Cyber Riskrisk Management:management :AI-generated Warningswarnings of Threatsthreats (doctoral dissertation,|year=2019 |publisher=Stanford University) |last1=Faber |first1=Isaac Justin |last2=Paté-Cornell |first2=M.] Elisabeth |last3=Lin |first3=Herbert |last4=Shachter |first4=Ross D. }}</ref>
 
=== Shape flexibility ===
A QPD with <math>n</math> terms, where <math>n\ge 2</math>, has <math>n-2</math> shape parameters. Thus, QPDs can be far more flexible than the [[Pearson distribution]]s, which have at most two shape parameters. For example, ten-term [[metalog distribution]]s parameterized by 105 CDF points from 30 traditional source distributions (including normal, student-t, lognormal, gamma, beta, and extreme value) have been shown to approximate each such source distribution within a [[Kolmogorov–Smirnov test|K–S]] distance of 0.001 or less.<ref>[[{{Cite journal |doi:=10.1287/deca.2016.0338|at=Table 8 |title=The Metalog Distributions |year=2016 |last1=Keelin, T.|first1=Thomas W. (2016),|journal=Decision TableAnalysis 8]]|volume=13 |issue=4 }}</ref>
 
=== Transformations ===
QPD transformations are governed by a general property of quantile functions: for any [[quantile function]] <math>x=Q(y)</math> and increasing function <math>t(x), x=t^{-1} (Q(y))</math> is a [[quantile function]].<ref>Gilchrist, W., 2000. Statistical modelling with quantile functions. CRC Press.</ref> For example, the [[quantile function]] of the [[normal distribution]], <math>x=\mu+\sigma \Phi^{-1} (y)</math>, is a QPD by the Keelin and Powley definition. The natural logarithm, <math>t(x)=\ln(x-b_l)</math>, is an increasing function, so <math>x=b_l+e^{\mu+\sigma \Phi^{-1} (y)}</math> is the [[quantile function]] of the [[Log-normal distribution|lognormal distribution]] with lower bound <math>b_l</math>. Importantly, this transformation converts an unbounded QPD into a semi-bounded QPD. Similarly, applying this log transformation to the [https://en.wikipedia.org/wiki/Metalog_distribution[Metalog distribution#Unbounded,_semibounded,_and_bounded_metalog_distributions |unbounded metalog distribution]]<ref name="UnboundedMetalog">[[{{Cite journal |doi:=10.1287/deca.2016.0338|Keelin, T.W. (2016), at=Section 3, pp. 249–257 |title=The Metalog Distributions |year=2016 |last1=Keelin |first1=Thomas W.]] |journal=Decision Analysis |volume=13 |issue=4 }}</ref> yields the [https://en.wikipedia.org/wiki/Metalog_distribution[Metalog distribution#Unbounded,_semibounded,_and_bounded_metalog_distributions |semi-bounded (log) metalog distribution]];<ref name="KeelinSec4">[[{{Cite journal |doi:=10.1287/deca.2016.0338|at=Section 4 |title=The Metalog Distributions |year=2016 |last1=Keelin, T.|first1=Thomas W. (2016),|journal=Decision SectionAnalysis |volume=13 |issue=4.]] }}</ref> likewise, applying the logit transformation, <math>t(x)=\ln((x-b_l)/(b_u-x))</math>, yields the [https://en.wikipedia.org/wiki/Metalog_distribution[Metalog distribution#Unbounded,_semibounded,_and_bounded_metalog_distributions |bounded (logit) metalog distribution]]<ref name="KeelinSec4" /> with lower and upper bounds <math>b_l</math> and <math>b_u</math>, respectively. Moreover, by considering <math>t(x)</math> to be <math>F^{-1} (y)</math> distributed, where <math>F^{-1} (y)</math> is any QPD that meets Keelin and Powley’s definition, the transformed variable maintains the above properties of feasibility, convexity, and fitting to data. Such transformed QPDs have greater shape flexibility than the underlying <math>F^{-1} (y)</math>, which has <math>n-2</math> shape parameters; the log transformation has <math>n-1</math> shape parameters, and the logit transformation has <math>n</math> shape parameters. Moreover, such transformed QPDs share the same set of feasible coefficients as the underlying untransformed QPD.<ref>[http://metalogdistributions.com/images/Powley_Dissertation_2013-augmented.pdf Powley, B.W. (2013). “Quantile Function Methods For Decision Analysis”. Corollary 12, p 30. PhD Dissertation, Stanford University]</ref>
 
 
Line 87:
* The quantile function of the [[logistic distribution]], <math>x=\mu+s \ln(y/(1-y) )</math>.
* The unbounded [[metalog distribution]], which is a power series expansion of the <math>\mu</math> and <math>s</math> parameters of the logistic quantile function.
* The [https://en.wikipedia.org/wiki/Metalog_distribution[Metalog distribution#Unbounded,_semibounded,_and_bounded_metalog_distributions |semi-bounded and bounded metalog distributions]], which are the log and logit transforms, respectively, of the unbounded metalog distribution.
* The [https://en.wikipedia.org/wiki/Metalog_distribution[Metalog distribution#SPT_metalog_distributions |SPT (symmetric-percentile triplet) unbounded, semi-bounded, and bounded metalog distributions]], which are parameterized by three CDF points and optional upper and lower bounds.
* The Simple Q-Normal distribution<ref>[[{{Cite journal |doi:=10.1287/deca.1110.0213|at=pp. 208–210 |title=Quantile-Parameterized Distributions |year=2011 |last1=Keelin, T.|first1=Thomas W., and |last2=Powley, B.|first2=Bradford W. (2011),|journal=Decision pp.Analysis 208–210]]|volume=8 |issue=3 }}</ref>
* The metadistributions, including the meta-normal<ref>[[{{Cite journal |page=253 |doi:=10.1287/deca.2016.0338 |title=The Metalog Distributions |year=2016 |last1=Keelin, T.|first1=Thomas W. (2016),|journal=Decision p.Analysis 253.]]|volume=13 |issue=4 }}</ref>
* Quantile functions expressed as [[polynomial]] functions of cumulative probability <math>y</math>, including [[Chebyshev polynomial]] functions.
 
Like the SPT metalog distributions, the Johnson Quantile-Parameterized Distributions<ref>[{{cite journal | url=https://pubsonline.informs.org/doi/abs/10.1287/deca.2016.0343 Hadlock,| Cdoi=10.C1287/deca.2016.0343 and| Bickel,title=Johnson J.E.,Quantile-Parameterized Distributions | year=2017 | last1=Hadlock | first1=Christopher C. Johnson| quantile-parameterizedlast2=Bickel distributions| first2=J. Eric | journal=Decision Analysis, | volume=14(1), pp.| pages=35–64.] | url-access=subscription }}</ref><ref>[{{cite journal | url=https://pubsonline.informs.org/doi/abs/10.1287/deca.2018.0376 Hadlock,| Cdoi=10.C1287/deca. and Bickel, J2018.E.,0376 2019.| title=The generalizedGeneralized Johnson quantileQuantile-parameterizedParameterized distributionDistribution systemSystem | year=2019 | last1=Hadlock | first1=Christopher C. | last2=Bickel | first2=J. Eric | journal=Decision Analysis, 14(1),| pp.volume=16 333.]| pages=67–85 | s2cid=159339224 | url-access=subscription }}</ref> (JQPDs) are parameterized by three quantiles. JQPDs do not meet Keelin and Powley’s QPD definition, but rather have their own properties. JQPDs are feasible for all SPT parameter sets that are consistent with the [[Probability theory|rules of probability]].
 
== Applications ==
The original applications of QPDs were by decision analysts wishing to conveniently convert expert-assessed quantiles (e.g., 10th, 50th, and 90th quantiles) into smooth continuous probability distributions. QPDs have also been used to fit output data from simulations in order to represent those outputs (both CDFs and PDFs) as closed-form continuous distributions.<ref>[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Section 6.2.2, pp. 271–274.]]</ref> Used in this way, they are typically more stable and smoother than histograms. Similarly, since QPDs can impose fewer shape constraints than traditional distributions, they have been used to fit a wide range of empirical data in order to represent those data sets as continuous distributions (e.g., reflecting bimodality that may exist in the data in a straightforward manner<ref>[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Section 6.1.1, Figure 10, pp 266–267.]]</ref>). Quantile parameterization enables a closed-form QPD representation of known distributions whose CDFs otherwise have no closed-form expression. Keelin et al. (2019)<ref>[{{cite book | url=https://dl.acm.org/doi/abs/10.5555/3400397.3400643 Keelin,| T.W.,isbn=9781728132839 Chrisman,| L. and Savage, S.L. (2019). “Thetitle=The metalog distributions and extremely accurate sums of lognormals in closed form.” WSC| '19:date=18 ProceedingsMay of2020 the| Winterpages=3074–3085 Simulation| Conference.last1=Mustafee 3074–3085| first1=N.] | publisher=Institute of Electrical and Electronics Engineers (IEEE) }}</ref> apply this to the sum of independent identically distributed lognormal distributions, where quantiles of the sum can be determined by a large number of simulations. Nine such quantiles are used to parameterize a semi-bounded metalog distribution that runs through each of these nine quantiles exactly. QPDs have also been applied to assess the risks of asteroid impact,<ref>[[doi{{cite journal | url=https://doi.org/10.1111/risa.12453 |Reinhardt, Jdoi=10.D1111/risa.,12453 Chen,| Xtitle=Asteroid Risk Assessment: A Probabilistic Approach | year=2016 | last1=Reinhardt | first1=Jason C., | last2=Chen | first2=Xi | last3=Liu, W.,| first3=Wenhao | last4=Manchev, P.| andfirst4=Petar Pate| last5=Paté-Cornell, | first5=M.E. (2016).Elisabeth “Asteroid| riskjournal=Risk assessment:Analysis A| probabilisticvolume=36 approach.”| Riskissue=2 Analysis.| pages=244–261 | pmid=26215051 | bibcode=2016RiskA..36..244R (2):| 244–261]]s2cid=23308354 | url-access=subscription }}</ref> cybersecurity,<ref name="Faber" /><ref>[{{cite journal | url=https://www.sciencedirect.com/science/article/pii/S0167404819300604 Wang,| Jdoi=10., Neil, M1016/j. and Fenton, Ncose. (2020)2019.101659 “A| title=A Bayesian network approach for cybersecurity risk assessment implementing and extending the FAIR model.” | year=2020 | last1=Wang | first1=Jiali | last2=Neil | first2=Martin | last3=Fenton | first3=Norman | journal=Computers & Security. | volume=89: | page=101659.] | s2cid=209099797 | url-access=subscription }}</ref> biases in projections of oil-field production when compared to observed production after the fact,<ref>[{{Cite journal |url=https://www.onepetro.org/journal-paper/SPE-195914-PA Bratvold, R|doi=10.B.,2118/195914-PA Mohus,|title=Production E.,Forecasting: Petutschnig,Optimistic D.and Overconfident—Over and Bickel,over E.Again (|year=2020) |last1=Bratvold |first1=Reidar B. “Production|last2=Mohus forecasting:|first2=Erlend Optimistic|last3=Petutschnig and|first3=David overconfident—Over|last4=Bickel and|first4=Eric over|journal=Spe again.”Reservoir SocietyEvaluation of& PetroleumEngineering Engineers.|volume=23 doi:10.2118/195914|issue=3 |pages=0799–0810 |s2cid=219661316 |url-PA.]access=subscription }}</ref> and future Canadian population projections based on combining the probabilistic views of multiple experts.<ref>[{{Cite book |url=https://library.oapen.org/bitstream/handle/20.500.12657/42565/2020_Book_DevelopmentsInDemographicForec.pdf?sequence=1#page=51 Dion,|title=Developments P.,in Galbraith,Demographic N.,Forecasting Sirag, E. (|year=2020). “Using|isbn=978-3-030-42471-8 expert|series=The elicitationSpringer toSeries buildon long-termDemographic projectionMethods assumptions.”and InPopulation DevelopmentsAnalysis in|volume=49 Demographic Forecasting, Chapter|pages=43–62 |doi=10.1007/978-3,-030-42472-5 pp|hdl=20. 43–62500.12657/42565 Springer]|s2cid=226615299}}</ref> See [https://en.wikipedia.org/wiki/Metalog_distribution[Metalog distribution#Applications |metalog distributions]] and Keelin (2016)<ref name="Keelin2016" /> for additional applications of the metalog distribution.