Revision as of 19:47, 29 December 2023 edit Lazydry (talk \| contribs) 12 edits →Wasserstein F-test ← Previous edit		Revision as of 05:50, 31 December 2023 edit undo Dedhert.Jr (talk \| contribs) Extended confirmed users 11,356 edits tidy up Next edit →
Line 4: '''Distributional data analysis''' is a branch of [[nonparametric statistics]] that is related to [[functional data analysis]]. It is concerned with random objects that are probability distributions, i.e., the statistical analysis of samples of random distributions where each atom of a sample is a distribution. One of the main challenges in distributional data analysis is that the space of probability distributions is, while a convex space, is not a [[vector space]]. == Notation == Let <math>\nu</math> be a probability measure on <math>D</math>, where <math>D \subset \R^p</math> with <math>p \ge 1</math>. The probability measure <math>\nu</math> can be equivalently characterized as [[cumulative distribution function]] <math>F</math> or [[probability density function]] <math>f</math> if it exists. For univariate distributions with <math>p = 1</math>, [[quantile function]] <math>Q=F^{-1}</math> can also be used. Line 13: <math>d_{W_p}(\nu_1, \nu_2) = \left( \int_0^1 [Q_1(s) - Q_2(s)]^p ds \right)^{1/p}</math>. == Mean and variance == For a probability measure <math>\nu \in \mathcal{F}</math>, consider a [[stochastic process\|random process]] <math>\mathfrak{F}</math> such that <math>\nu \sim \mathfrak{F}</math>. One way to define mean and variance of <math>\nu</math> is to introduce the [[Fréchet mean]] and the Fréchet variance. With respect to the metric <math>d</math> on <math>\mathcal{F}</math>, the ''Fréchet mean'' <math>\mu_\oplus</math>, also known as the [[barycenter]], and the ''Fréchet variance'' <math>V_\oplus</math> are defined as<ref>{{Cite journal\|last1=Fréchet\|first1=M.\|date=1948\|title=Les éléments aléatoires de nature quelconque dans un espace distancié\|journal=Annales de l'Institut Henri Poincaré\|volume=10\|issue=4\|pages=215–310}}</ref> Line 27: \end{align}</math> == Modes of variation == [[Modes of variation]] are useful concepts in depicting the variation of data around the mean function. Based on the [[Kosambi-Karhunen-Loève theorem\|Karhunen-Loève representation]], modes of variation show the contribution of each [[eigenfunction]] to the mean. === Functional principal component analysis === [[Functional principal component analysis\|Functional principal component analysis(FPCA)]] can be directly applied to the probability density functions.<ref>{{Cite journal\|last1=Kneip\|first1=A.\|last2=Utikal\|first2=K.J.\|date=2001\|title=Inference for density families using functional principal component analysis\|journal=Journal of the American Statistical Association\|volume=96\|issue=454\|pages=519–532\|doi=10.1198/016214501753168235\|s2cid=123524014 }}</ref> Consider a distribution process <math>\nu \sim \mathfrak{F}</math> and let <math>f</math> be the density function of <math>\nu</math>. Let the mean density function as <math>\mu(t) = \mathbb{E}\left[f(t)\right]</math> and the covariance function as <math>G(s,t) = \operatorname{Cov}(f(s), f(t))</math> with orthonormal eigenfunctions <math>\{\phi_j\}_{j=1}^\infty</math> and eigenvalues <math>\{\lambda_j\}_{j=1}^\infty</math>. Line 42 ⟶ 40: with some constant <math>A</math>, such as 2 or 3. === Transformation FPCA === Assume the probability density functions <math>f</math> exist, and let <math>\mathcal{F}_f</math> be the space of density functions. Transformation approaches introduce a continuous and invertible transformation <math>\Psi: \mathcal{F}_f \to \mathbb{H}</math>, where <math>\mathbb{H}</math> is a [[Hilbert space]] of functions. For instance, the log quantile density transformation or the centered log ratio transformation are popular choices.<ref>{{Cite journal\|last1=Petersen\|first1=A.\|last2=Müller\|first2=H.-G.\|date=2016\|title=Functional data analysis for density functions by transformation to a Hilbert space\|journal=Annals of Statistics\|volume=44\|issue=1\|pages=183–218\|doi=10.1214/15-AOS1363}}</ref><ref>{{Cite journal\|last1=van den Boogaart\|first1=K.G.\|last2=Egozcue\|first2=J.J.\|last3=Pawlowsky-Glahn\|first3=V.\|date=2014\|title=Bayes Hilbert spaces\|journal=Australian and New Zealand Journal of Statistics\|volume=56\|issue=2\|pages=171–194\|doi=10.1111/anzs.12074\|s2cid=120612578 }}</ref> Line 52 ⟶ 50: </math> === Log FPCA and Wasserstein Geodesic PCA === Endowed with metrics such as the Wasserstein metric <math>d_{W_2}</math> or the Fisher-Rao metric <math>d_{FR}</math>, we can employ the (pseudo) Riemannian structure of <math>\mathcal{F}</math>. Denote the [[tangent space]] at the Fréchet mean <math>\mu_\oplus</math> as <math>T_{\mu_\oplus}</math>, and define the logarithm and exponential maps <math>\log_{\mu_\oplus}:\mathcal{F} \to T_{\mu_\oplus}</math> and <math>\exp_{\mu_\oplus}: T_{\mu_\oplus} \to \mathcal{F}</math>. Let <math>Y</math> be the projected density onto the tangent space, <math>Y = \log_{\mu_\oplus}(f)</math>. Line 69 ⟶ 66: Note that the tangent space <math>T_{\mu_\oplus}</math> is a subspace of <math>L^2_{\mu_\oplus}</math>, the Hilbert space of <math>{\mu_\oplus}</math>-square-integrable functions. Obtaining the PGS is equivalent to performing PCA in <math>L^2_{\mu_\oplus}</math> under constraints to lie in the convex and closed subset.<ref name="gpca2"/> Therefore, a simple approximation of the Wasserstein Geodesic PCA is the Log FPCA by relaxing the geodesicity constraint, while alternative techniques are suggested.<ref name="gpca1"/><ref name="gpca2"/> == Distributional regression == === Fréchet regression ===▼ ▲==Fréchet regression== Fréchet regression is a generalization of regression with responses taking values in a metric space and Euclidean predictors.<ref name="freg">{{Cite journal\|last1=Petersen\|first1=A.\|last2=Müller\|first2=H.-G.\|date=2019\|title=Fréchet regression for random objects with Euclidean predictors\|journal=Annals of Statistics\|volume=47\|issue=2\|pages=691–719\|doi=10.1214/17-AOS1624 }}</ref><ref name="review">{{Cite journal\|last1=Petersen\|first1=A.\|last2=Zhang\|first2=C.\|last3=Kokoszka\|first3=P.\|date=2022\|title=Modeling probability density functions as data objects\|journal=Econometrics and Statistics\|volume=21\|pages=159–178\|doi=10.1016/j.ecosta.2021.04.004 \|s2cid=236589040 }}</ref> Using the Wasserstein metric <math>d_{W_2}</math>, Fréchet regression models can be applied to distributional objects. The global Wasserstein-Fréchet regression model is defined as {{NumBlk\|::\|<math display="block">\begin{align} Line 86 ⟶ 82: where <math>\mu_j = \mathbb{E} \left[K_h(X-x)(X-x)^j \right]</math>, <math>j = 0,1,2,</math> and <math>\sigma_0^2 = \mu_0 \mu_2 - \mu_1^2</math>. === Transformation based approaches === Consider the response variable <math>\nu</math> to be probability distributions. With the space of density functions <math>\mathcal{F}_f</math> and a Hilbert space of functions <math>\mathbb{H}</math>, consider continuous and invertible transformations <math>\Psi: \mathcal{F}_f \to \mathbb{H}</math>. Examples of transformations include log hazard transformation, log quantile density transformation, or centered log-ratio transformation. Linear methods such as [[functional regression#Functional linear models (FLMs)\|functional linear models]] are applied to the transformed variables. The fitted models are interpreted back in the original density space <math>\mathcal{F}</math> using the inverse transformation.<ref name="review"/> === Random object approaches === In Wasserstein regression, both predictors <math>\omega</math> and responses <math>\nu</math> can be distributional objects. Let <math>\omega{\oplus}</math> and <math>\nu_{\oplus}</math> be the Wasserstein mean of <math>\omega</math> and <math>\nu</math>, respectively. The Wasserstein regression model is defined as <math display="block">\mathbb{E}(\log_{\nu_{\oplus}} \nu \| \log_{\omega{\oplus}} \omega) = \Gamma(\log_{\omega{\oplus}} \omega),</math> Line 98 ⟶ 94: Also, the Fisher-Rao metric <math>d_{FR}</math> can be used in a similar fashion.<ref name="review"/><ref name="dai2022">{{Cite journal\|last1=Dai\|first1=X.\|date=2022\|title=Statistical inference on the Hilbert sphere with application to random densities\|journal=Electronic Journal of Statistics\|volume=16\|issue=1\|pages=700–736\|doi=10.1214/21-EJS1942 }}</ref> == Hypothesis testing == === Wasserstein F-test ===▼ ▲==Wasserstein F-test== Wasserstein <math>F</math>-test has been proposed to test for the effects of the predictors in the Fréchet regression framework with the Wasserstein metric.<ref name="ftest">{{Cite journal\|last1=Petersen\|first1=A.\|last2=Liu\|first2=X.\|last3=Divani\|first3=A.A.\|date=2021\|title=Wasserstein F-tests and confidence bands for the Fréchet regression of density response curves\|journal=Annals of Statistics\|volume=49\|issue=1\|pages=590–611\|doi=10.1214/20-AOS1971 \|arxiv=1910.13418 \|s2cid=204950494 }}</ref> Consider Euclidean predictors <math>X \in \R^p</math> and distributional responses <math>\nu \in \mathcal{W}_2</math>. Denote the Wasserstein mean of <math>\nu</math> as <math>\mu_\oplus^</math>, and the sample Wasserstein mean as <math>\hat{\mu}_\oplus^</math>. Consider the global Wasserstein-Fréchet regression model <math>m_\oplus (x)</math> defined in ({{EquationNote\|1}}), which is the conditional Wasserstein mean given <math>X=x</math>. The estimator of <math>m_\oplus (x)</math>, <math>\hat{m}_\oplus (x)</math> is obtained by minimizing the empirical version of the criterion. Line 121 ⟶ 116: [[Welch-Satterthwaite_equation\|Satterthwaite's approximation]] or a [[Bootstrapping_(statistics)\|bootstrap]] approach are proposed.<ref name="ftest"/> === Tests for the intrinsic mean === The Hilbert sphere <math>\mathcal{S}^\infty</math> is defined as <math>\mathcal{S}^\infty = \left\{f \in \mathbb{H} : \\| f \\|_{\mathbb{H}}=1 \right\}</math>, where <math>\mathbb{H}</math> is a separable infinite-dimensional Hilbert space with inner product <math>\langle \cdot, \cdot \rangle_{\mathbb{H}}</math> and norm <math>\\| \cdot \\|_{\mathbb{H}}</math>. Consider the space of square root densities <math>\mathcal{X} = \left\{ x:D \to \mathbb{R}: x = \sqrt{f}, \int_D f(t)dt = 1 \right\}</math>. Then with the Fisher-Rao metric <math>d_{FR}</math> on <math>f</math>, <math>\mathcal{X}</math> is the positive orthant of the Hilbert sphere <math>\mathcal{S}^\infty</math> with <math>\mathbb{H} = L^2(D)</math>. Line 140 ⟶ 135: where <math>W_k \overset{iid}{\sim} \chi_1^2</math>. The actual testing procedure can be done by employing the limiting distributions with Monte Carlo simulations, or bootstrap tests are possible. An extension to the two-sample test and paired test are also proposed.<ref name="dai2022"/> == Distributional time series == [[Autoregressive model\|Autoregressive (AR) models]] for distributional time series are constructed by defining [[Stationary process\|stationarity]] and utilizing the notion of difference between distributions using <math>d_{W_2}</math> and <math>d_{FR}</math>. Line 154 ⟶ 149: where <math>\mu_R = \mathbb{E}R_t</math> and mean zero random i.i.d innovations <math>\epsilon_t</math>. An alternative model, the differenced based spherical autoregressive (DSAR) model is defined with <math>R_t = x_{t+1} \ominus x_t</math>, with natural extensions to order <math>p</math>. A similar extension to the Wasserstein space was introduced.<ref>{{Cite journal\|last1=Zhu\|first1=C.\|last2=Müller\|first2=H.-G.\|date=2023\|title=Autoregressive optimal transport models\|journal=Journal of the Royal Statistical Society Series B: Statistical Methodology\|volume=85\|issue=3\|pages=1012–1033\|doi=10.1093/jrsssb/qkad051 \|pmid=37521164 \|pmc=10376456 }}</ref> == References == {{reflist}} [[Category:Statistical analysis]]

Distributional data analysis: Difference between revisions