Quantile-parameterized distribution: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Line 2:
 
== History ==
The development of quantile-parameterized distributions was inspired by the practical need for flexible continuous probability distributions that are easy to fit to data. Historically, the [[Pearson distribution|Pearson]]<ref>Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, Vol 1, Second Edition, John Wiley & Sons, Ltd, 1994, pp. 15-2515–25.</ref> and [[Norman Lloyd Johnson|Johnson]]<ref>[https://www.jstor.org/stable/2332539?seq=1 Johnson, N. L. (1949). “Systems of frequency curves generated by methods of translation.” Biometrika. 36 (1/2): 149–176. doi:10.2307/2332539.]</ref><ref>[https://www.jstor.org/stable/2335422 Tadikamalla, P. R. and Johnson, N. L. (1982). “Systems of frequency curves generated by transformations of logistic variables.” Biometrika. 69 (2): 461–465.]</ref> families of distributions have been used when shape flexibility is needed. That is because both families can match the first four moments (mean, variance, skewness, and kurtosis) of any data set. In many cases, however, these distributions are either difficult to fit to data, or not flexible enough to fit the data appropriately.
 
For example, the [[beta distribution]] is a flexible Pearson distribution that is frequently used to model percentages of a population. However, if the characteristics of this population are such that the desired [[cumulative distribution function]] (CDF) should run through certain specific CDF points, there may be no beta distribution that meets this need. Because the beta distribution has only two shape parameters, it cannot, in general, match even three specified CDF points. Moreover, the beta parameters that best fit such data can be found only by nonlinear iterative methods.
Line 14:
F^{-1} (y)= \left\{
\begin{array}{cl}
L_0 & \mboxtext{for } y=0\\
\sum_{i=1}^n a_i g_i(y) & \mboxtext{for } 0<y<1 \\
L_1 & \mbox{for } y=1
\end{array}\right.
Line 29:
</math>
 
and the functions <math>g_i(y)</math> are continuously differentiable and linearly independent basis functions. Here, essentially, <math>L_0</math> and <math>L_1</math> are the lower and upper bounds (if they exist) of a random variable with quantile function <math>F^{-1}(y)</math>. These distributions are called quantile-parameterized because for a given set of quantile pairs <math>\{(x_i, y_i) |\mid i=1,...\ldots,n\}</math>, where <math>x_i=F^{-1}(y_i)</math>, and a set of <math>n</math> basis functions <math>g_i(y)</math>, the coefficients <math>a_i</math> can be determined by solving a set of linear equations<ref name="KeelinPowley" />. If one desires to use more quantile pairs than basis functions, then the coefficients <math>a_i</math> can be chosen to minimize the sum of squared errors between the stated quantiles <math>x_i</math> and <math>F^{-1}(y_i)</math>. Keelin and Powley<ref name="KeelinPowley" /> illustrate this concept for a specific choice of basis functions that is a generalization of quantile function of the [[normal distribution]], <math>x=\mu+\sigma \phi^{-1} (y)</math>, for which the mean <math>\mu</math> and standard deviation <math>\sigma</math> are linear functions of cumulative probability <math>y</math>:
 
: <math>\mu(y)=a_1+a_4 y</math>
Line 41:
QPD’s that meet Keelin and Powley’s definition have the following properties.
 
=== Probability Densitydensity Functionrunction ===
Differentiating <math>x=F^{-1} (y)=\sum_{i=1}^n a_i g_i (y)</math> with respect to <math>y</math> yields <math>dx/dy</math>. The reciprocal of this quantity, <math>dy/dx</math>, is the [[probability density function]] (PDF)
 
Line 50:
 
=== Feasibility ===
A function of the form of <math>F^{-1} (y)</math> is a feasible probability distribution if and only if <math>f(y)>0</math> for all <math>y \in (0,1)</math><ref name="KeelinPowley" />. This implies a feasibility constraint on the set of coefficients <math>\boldsymbol a=(a_1,...\ldots,a_n) \in \R^n</math>:
 
: <math>\sum_{i=1}^n a_i {{d g_i(y)}\over{dy}} >0</math> for all <math>y \in (0,1)</math>
Line 59:
A QPD’s set of feasible coefficients <math>S_\boldsymbol a=\{\boldsymbol a\in\R^n |\sum_{i=1}^n a_i d g_i (y)/dy > 0</math> for all <math>y\in (0,1)\}</math> is [[Convex set|convex]]. Because [[convex optimization]] problems require convex feasible sets, this property simplifies optimization problems involving QPDs.
 
=== Fitting to Datadata ===
The coefficients <math>\boldsymbol a</math> can be determined from data by [[linear least squares]]. Given <math>m</math> data points <math>(x_i,y_i)</math> that are intended to characterize the CDF of a QPD, and <math>m \times n</math> matrix <math>\boldsymbol Y</math> whose elements consist of <math>g_j (y_i)</math>, then, so long as <math>\boldsymbol Y^T \boldsymbol Y</math> is invertible, coefficients' column vector <math>\boldsymbol a</math> can be determined as <math>\boldsymbol a=(\boldsymbol Y^T \boldsymbol Y)^{-1} \boldsymbol Y^T \boldsymbol x</math>, where <math>m\geq n</math> and column vector <math>\boldsymbol x=(x_1,...\ldots,x_m)</math>. If <math>m=n</math>, this equation reduces to <math>\boldsymbol a=\boldsymbol Y^{-1} \boldsymbol x</math>, where the resulting CDF runs through all data points exactly. An alternate method, implemented as a linear program, determines the coefficients by minimizing the sum of absolute distances between the CDF and the data subject to feasibility constraints.<ref name="Faber">[https://searchworks.stanford.edu/view/13257318 Faber, I.J. (2019). Cyber Risk Management: AI-generated Warnings of Threats (Doctoraldoctoral dissertation, Stanford University).]</ref>.
 
=== Shape Flexibilityflexibility ===
A QPD with <math>n</math> terms, where <math>n\ge 2</math>, has <math>n-2</math> shape parameters. Thus, QPDs can be far more flexible than the [[Pearson distribution|Pearson distributions]], which have at most two shape parameters. For example, ten-term [http://www.metalogs.org metalog] distributions parameterized by 105 CDF points from 30 traditional source distributions (including normal, student-t, lognormal, gamma, beta, and extreme value) have been shown to approximate each such source distribution within a [[Kolmogorov–Smirnov test|K-S]] distance of 0.001 or less<ref>[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Table 8]]</ref>.
 
=== Transformations ===
QPD transformations are governed by a general property of quantile functions: for any [[quantile function]] <math>x=Q(y)</math> and increasing function <math>t(x), x=t^{-1} (Q(y))</math> is a [[quantile function]]<ref>Gilchrist, W., 2000. Statistical modelling with quantile functions. CRC Press.</ref>. For example, the [[quantile function]] of the [[normal distribution]], <math>x=\mu+\sigma \phi^{-1} (y)</math>, is a QPD by the Keelin and Powley definition. The natural logarithm, <math>t(x)=\ln(x-b_l)</math>, is an increasing function, so <math>x=b_l+e^{\mu+\sigma \phivarphi^{-1} (y)}</math> is the [[quantile function]] of the [[Log-normal distribution|lognormal distribution]] with lower bound <math>b_l</math>. Importantly, this transformation converts an unbounded QPD into a semi-bounded QPD. Similarly, applying this log transformation to the unbounded metalog distribution<ref name="UnboundedMetalog">[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Section 3, pp. 249-257249–257.]]</ref> yields the semi-bounded (log) metalog distribution<ref name="KeelinSec4">[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Section 4.]]</ref>; likewise, applying the logit transformation, <math>t(x)=\ln((x-b_l)/(b_u-x))</math>, yields the bounded (logit) metalog distribution<ref name="KeelinSec4" /> with lower and upper bounds <math>b_l</math> and <math>b_u</math>, respectively. Moreover, by considering <math>t(x)</math> to be <math>F^{-1} (y)</math> distributed, where <math>F^{-1} (y)</math> is any QPD that meets Keelin and Powley’s definition, the transformed variable maintains the above properties of feasibility, convexity, and fitting to data. Such transformed QPDs have greater shape flexibility than the underlying <math>F^{-1} (y)</math>, which has <math>n-2</math> shape parameters; the log transformation has <math>n-1</math> shape parameters, and the logit transformation has <math>n</math> shape parameters.
 
=== Moments ===
Line 87:
* The unbounded metalog (meta-logistic) distribution<ref name="UnboundedMetalog" />, which is a power series expansion of the <math>\mu</math> and <math>s</math> parameters of the logistic quantile function.
* The semi-bounded and bounded metalog distributions<ref name="KeelinSec4" />, which are the log and logit transforms, respectively, of the unbounded metalog distribution.
* The SPT (symmetric-percentile triplet) unbounded, semi-bounded, and bounded metalog distributions<ref name="SPT">[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), pp. 269-271269–271.]]</ref>, which are parameterized by three CDF points and optional upper and lower bounds.
* The Simple Q-Normal distribution<ref>[[doi:10.1287/deca.1110.0213|Keelin, T.W., and Powley, B.W. (2011), pp. 208-210208–210]]</ref>
* The metadistributions, including the meta-normal<ref>[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), p. 253.]]</ref>
* Quantile functions expressed as [[polynomial]] functions of cumulative probability <math>y</math>, including [[Chebyshev polynomial]] functions.
 
Like the SPT metalog distributions<ref name="SPT" />, the Johnson Quantile-Parameterized Distributions<ref>[https://pubsonline.informs.org/doi/abs/10.1287/deca.2016.0343 Hadlock, C.C. and Bickel, J.E., 2017. Johnson quantile-parameterized distributions. Decision Analysis, 14(1), pp.35-64 35–64.]</ref><ref>[https://pubsonline.informs.org/doi/abs/10.1287/deca.2018.0376 Hadlock, C.C. and Bickel, J.E., 2019. The generalized Johnson quantile-parameterized distribution system. Decision Analysis, 14(1), pp. 333.]</ref> (JQPDs) are parameterized by three quantiles. JQPDs do not meet Keelin and Powley’s QPD definition, and thus have their own properties.
 
== Applications ==
The original applications of QPDs were by decision analysts wishing to conveniently convert expert-assessed quantiles (e.g., 10th, 50th, and 90th quantiles) into smooth continuous probability distributions. QPDs have also been used to fit output data from simulations in order to represent those outputs (both CDFs and PDFs) as closed-form continuous distributions<ref>[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Section 6.2.2, pp. 271-274271–274.]]</ref>. Used in this way, they are typically more stable and smoother than histograms. Similarly, since QPDs can impose fewer shape constraints than traditional distributions, they have been used to fit a wide range of empirical data in order to represent those data sets as continuous distributions (e.g., reflecting bimodality that may exist in the data in a straightforward manner<ref>[[doi:10.1287/deca.2016.0338|Keelin, T.W. (2016), Section 6.1.1, Figure 10, pp 266-267266–267.]]</ref>). Quantile parameterization enables a closed-form QPD representation of known distributions whose CDFs otherwise have no closed-form expression. Keelin et al. (2019)<ref>[https://dl.acm.org/doi/abs/10.5555/3400397.3400643 Keelin, T.W., Chrisman, L. and Savage, S.L. (2019). “The metalog distributions and extremely accurate sums of lognormals in closed form.” WSC '19: Proceedings of the Winter Simulation Conference. 3074–3085.]</ref> apply this to the sum of independent identically distributed lognormal distributions, where quantiles of the sum can be determined by a large number of simulations. Nine such quantiles are used to parameterize a semi-bounded metalog distribution that runs through each of these nine quantiles exactly. QPDs have also been applied to assess the risks of asteroid impact<ref>[[doi:10.1111/risa.12453|Reinhardt, J.D., Chen, X., Liu, W., Manchev, P. and Pate-Cornell, M.E. (2016). “Asteroid risk assessment: A probabilistic approach.” Risk Analysis. 36 (2): 244–261]]</ref>, cybersecurity<ref name="Faber" /><ref>[https://www.sciencedirect.com/science/article/pii/S0167404819300604 Wang, J., Neil, M. and Fenton, N. (2020). “A Bayesian network approach for cybersecurity risk assessment implementing and extending the FAIR model.” Computers & Security. 89: 101659.]</ref>, biases in projections of oil-field production when compared to observed production after the fact<ref>[https://www.onepetro.org/journal-paper/SPE-195914-PA Bratvold, R.B., Mohus, E., Petutschnig, D. and Bickel, E. (2020). “Production forecasting: Optimistic and overconfident—Over and over again.” Society of Petroleum Engineers. doi:10.2118/195914-PA.]</ref>, and future Canadian population projections based on combining the probabilistic views of multiple experts<ref>[https://library.oapen.org/bitstream/handle/20.500.12657/42565/2020_Book_DevelopmentsInDemographicForec.pdf?sequence=1#page=51 Dion, P., Galbraith, N., Sirag, E. (2020). “Using expert elicitation to build long-term projection assumptions.” In Developments in Demographic Forecasting, Chapter 3, pp. 43–62. Springer]</ref>. See Keelin (2016)<ref name="Keelin2016" /> for additional applications of the metalog distribution.
 
== External links ==
Line 106:
 
[[Category:Continuous distributions]]
 
== Quantile-parameterized distribution ==