Content deleted Content added
m →Distance covariance: "squared" was incorrect here |
Link suggestions feature: 2 links added. |
||
(57 intermediate revisions by 20 users not shown) | |||
Line 1:
{{Short description|Statistical measure}}
In [[statistics]] and in [[probability theory]], '''distance correlation''' or '''distance covariance''' is a measure of [[
Distance correlation can be used to perform a [[Statistical hypothesis testing|statistical test]] of dependence with a [[permutation test]]. One first computes the distance correlation (involving the re-centering of Euclidean distance matrices) between two random vectors, and then compares this value to the distance correlations of many shuffles of the data.
The distance correlation is derived from a number of other quantities that are used in its specification, specifically: '''distance variance''', '''distance standard deviation''' and '''distance covariance'''. These quantities take the same roles as the ordinary [[Moment (mathematics)|moment]]s with corresponding names in the specification of the [[Pearson product-moment correlation coefficient]].▼
[[Image:Distance Correlation Examples.svg|thumb|
▲[[Image:Distance Correlation Examples.svg|thumb|400px|right|Several sets of (''x'', ''y'') points, with the distance correlation coefficient of ''x'' and ''y'' for each set. Compare to the graph on [[correlation]]]]
==Background==
The classical measure of dependence, the [[Pearson product-moment correlation coefficient|Pearson correlation coefficient]],<ref>{{harvs|nb|last=Pearson
▲The distance correlation is derived from a number of other quantities that are used in its specification, specifically: '''distance variance''', '''distance standard deviation''', and '''distance covariance'''. These quantities take the same roles as the ordinary [[Moment (mathematics)|moment]]s with corresponding names in the specification of the [[Pearson product-moment correlation coefficient]].
==Definitions==
Line 24 ⟶ 25:
</math>
where ||
:<math>
Line 31 ⟶ 32:
</math>
where <math>\textstyle \overline{a}_{j\cdot}</math> is the {{math|''j''}}-th row mean, <math>\textstyle \overline{a}_{\cdot k}</math> is the {{math|''k''}}-th column mean, and <math>\textstyle \overline{a}_{\cdot\cdot}</math> is the [[grand mean]] of the distance matrix of the {{math|''X''}} sample. The notation is similar for the {{math|''b''}} values. (In the matrices of centered distances (''A''<sub>''j'', ''k''</sub>) and (''B''<sub>''j'',''k''</sub>) all rows and all columns sum to zero.) The squared '''sample distance covariance''' (a scalar) is simply the arithmetic average of the products ''A''<sub>''j'', ''k ''</sub>''B''<sub>''j'', ''k''</sub>:
:<math>
Line 37 ⟶ 38:
</math>
The statistic ''T''<sub>''n''</sub> = ''n'' dCov<sup>2</sup><sub>''n''</sub>(''X'', ''Y'') determines a consistent multivariate test of independence of random vectors in arbitrary dimensions. For an implementation see ''dcov.test'' function in the ''energy'' package for [[R (programming language)|R]].
The population value of '''distance covariance''' can be defined along the same lines. Let ''X'' be a random variable that takes values in a ''p''-dimensional Euclidean space with probability distribution {{math|
:<math>a_\mu(x):= \operatorname{E}[\|X-x\|], \quad D(\mu) := \operatorname{E}[a_\mu(X)], \quad d_\mu(x, x') := \|x-x'\|-a_\mu(x)-a_\mu(x')+D(\mu).
Line 61 ⟶ 62:
where '''''E''''' denotes expected value, and <math>\textstyle (X, Y),</math> <math>\textstyle (X', Y'),</math> and <math>\textstyle (X'',Y'')</math> are independent and identically distributed. The primed random variables <math>\textstyle (X', Y')</math> and <math>\textstyle (X'',Y'')</math> denote
independent and identically distributed (iid) copies of the variables <math>X</math> and <math>Y</math> and are similarly iid.
'''cov''', as follows:
Line 69 ⟶ 70:
This identity shows that the distance covariance is not the same as the covariance of distances, {{nowrap|cov({{norm|''X'' − ''X' ''}}, {{norm|''Y'' − ''Y' '' }}}}). This can be zero even if ''X'' and ''Y'' are not independent.
Alternatively, the distance covariance can be defined as the weighted [[Norm (mathematics)#Euclidean_norm|''L''<sup>2</sup> norm]] of the distance between the joint [[Characteristic function (probability theory)|characteristic function]] of the random variables and the product of their marginal characteristic functions:<ref name=SR2009a>{{harvnb|Székely
: <math>
\operatorname{dCov}^2(X,Y)= \frac 1 {c_p c_q} \int_{\mathbb{R}^{p+q}} \frac{\left|
</math>
where
dCov<sup>2</sup>(''X'', ''Y'') = 0 if and only if ''X'' and ''Y'' are independent.
===Distance variance and distance standard deviation===
The ''distance variance'' is a special case of distance covariance when the two variables are identical. The population value of distance variance is the square root of
Line 86 ⟶ 87:
</math>
where <math>
The ''sample distance variance'' is the square root of
Line 94 ⟶ 95:
</math>
which is a relative of [[Corrado Gini]]
The ''distance standard deviation'' is the square root of the ''distance variance''.
Line 100 ⟶ 101:
===Distance correlation===
The ''distance correlation''
:<math>
\operatorname{dCor}^2(X,Y) = \frac{\operatorname{dCov}^2(X,Y)}{\sqrt{\operatorname{dVar}^2(X)\,\operatorname{dVar}^2(Y)}},
</math>
and the ''sample distance correlation'' is defined by substituting the sample distance covariance and distance variances for the population coefficients above.
For easy computation of sample distance correlation see the ''dcor'' function in the ''energy'' package for [[R (programming language)|R]].
==Properties==
Line 114 ⟶ 115:
===Distance correlation===
{{Ordered list |list_style_type=lower-roman
|
this is in contrast to Pearson's correlation, which can be negative.
|
|
}}
===Distance covariance===
{{Ordered list |list_style_type=lower-roman
|
|
for all constant vectors <math>a_1, a_2</math>, scalars <math>b_1, b_2</math>, and orthonormal matrices <math>\mathbf{C}_1, \mathbf{C}_2</math>.
|
:<math>
\operatorname{dCov}(X_1 + X_2, Y_1 + Y_2) \leq \operatorname{dCov}(X_1, Y_1) + \operatorname{dCov}(X_2, Y_2).
Line 135 ⟶ 136:
Equality holds if and only if <math>X_1</math> and <math>Y_1</math> are both constants, or <math>X_2</math> and <math>Y_2</math> are both constants, or <math>X_1, X_2, Y_1, Y_2</math> are mutually independent.
|
}}
This last property is the most important effect of working with centered distances.
The statistic <math>\operatorname{dCov}^2_n(X,Y)</math> is a biased estimator of <math>\operatorname{dCov}^2(X,Y)</math>. Under independence of X and Y
:<math>
Line 148 ⟶ 149:
</math>
An [[Bias of an estimator|unbiased estimator]] of <math>\operatorname{dCov}^2(X,Y)</math> is given by Székely and Rizzo.
===Distance variance===
{{Ordered list |list_style_type=lower-roman
|
|
|
|
}}
Equality holds in (iv) if and only if one of the random variables {{mvar|X}} or {{mvar|Y}} is a constant.
Line 172 ⟶ 173:
</math>
Then for every <math>0<\alpha<2</math>, <math>X</math> and <math>Y</math> are independent if and only if <math>\operatorname{dCov}^2(X, Y; \alpha) = 0</math>. It is important to note that this characterization does not hold for exponent <math>\alpha=2</math>; in this case for bivariate <math>(X, Y)</math>, <math>\operatorname{dCor}(X, Y; \alpha=2)</math> is a deterministic function of the Pearson correlation.
:<math>
\operatorname{dCov}^2_n(X, Y; \alpha):= \frac{1}{n^2}\sum_{k,\ell}A_{k,\ell}\,B_{k,\ell}.
Line 181 ⟶ 182:
\operatorname{dCov}^2(X, Y) := \operatorname{E}\big[d_\mu(X,X')d_\nu(Y,Y')\big].
</math>
This is non-negative for all such <math>X, Y</math> iff both metric spaces have negative type.
==Alternative definition of distance covariance==
Line 192 ⟶ 189:
Alternately, one could define '''''distance covariance''''' to be the square of the energy distance:
<math> \operatorname{dCov}^2(X,Y).</math> In this case, the distance standard deviation of <math>X</math> is measured in the same units as <math>X</math> distance, and there exists an unbiased estimator for the population distance covariance.
Under these alternate definitions, the distance correlation is also defined as the square <math>\operatorname{dCor}^2(X,Y)</math>, rather than the square root.
Line 213 ⟶ 210:
</math>
whenever the subtracted conditional expected value exists and denote by Y<sub>V</sub> the V-centered version of Y.
:<math>
\operatorname{cov}_{U,V}^2(X,Y) := \operatorname{E}\left[X_U X_U^\mathrm{'} Y_V Y_V^\mathrm{'}\right]
Line 234 ⟶ 230:
\operatorname{cov}_{\mathrm{id}}(X,Y) = \left\vert\operatorname{cov}(X,Y)\right\vert.
</math>
==Related metrics==
Other correlational metrics, including kernel-based correlational metrics (such as the Hilbert-Schmidt Independence Criterion or HSIC) can also detect linear and nonlinear interactions. Both distance correlation and kernel-based metrics can be used in methods such as [[canonical correlation analysis]] and [[independent component analysis]] to yield stronger [[statistical power]].
==See also==
* [[RV coefficient]]
* For a related third-order statistic, see [[Skewness#Distance skewness|Distance skewness]].
==Notes==
{{reflist|20em}}
==References==
*
*{{cite book |last=Gini
*{{cite book |last=Klebanov |first=L. B. |year=2005 |title=''N''-distances and their applications |publisher=[[Karolinum Press]], Charles University |place=Prague |isbn=9788024611525}}
*Pearson, K. (1895). "Note on regression and inheritance in the case of two parents", ''[[Proceedings of the Royal Society]]'', 58, 240–242▼
*{{cite journal |doi=10.1214/09-AOAS312B |arxiv=1010.0822 |title=Discussion of: Brownian distance covariance |year=2009 |last1=Kosorok |first1=Michael R. |journal=[[The Annals of Applied Statistics]] |volume=3 |issue=4 |pages=1270–1278 |s2cid=88518490 }}
*{{Cite journal |last1=Lyons |first1=Russell |year=2014 |title=Distance covariance in metric spaces |journal=The Annals of Probability |volume=41 |issue=5 |pages=3284–3305 |arxiv=1106.5758 |doi=10.1214/12-AOP803 |s2cid=73677891}}
▲*{{cite journal |last=Pearson
*{{cite journal |last=Pearson |first=K. |year=1895b |title=Notes on the history of correlation |journal=[[Biometrika]] |volume=13 |pages=25–45 |doi=10.1093/biomet/13.1.25 |url=https://zenodo.org/record/1431597 }}
*{{cite web |last1=Rizzo |first1=Maria |last2=Székely |first2=Gábor |date=2021-02-22 |title=energy: E-Statistics: Multivariate Inference via the Energy of Data |version=Version: 1.7-8 |url=https://cran.r-project.org/web/packages/energy/index.html |access-date=2021-10-31}}
*{{cite journal |last1=Székely |first1=Gábor J. |last2=Rizzo |first2=Maria L. |last3=Bakirov |first3=Nail K. |year=2007 |title=Measuring and testing independence by correlation of distances |journal=[[The Annals of Statistics]] |volume=35 |issue=6 |pages=2769–2794 |doi=10.1214/009053607000000505 |arxiv=0803.4101 |s2cid=5661488}}
*{{cite journal |doi=10.1214/09-AOAS312 |pmid=20574547 |pmc=2889501 |url=http://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453933 |title=Brownian distance covariance |year=2009a |last1=Székely |first1=Gábor J. |last2=Rizzo |first2=Maria L. |journal=[[The Annals of Applied Statistics]] |volume=3 |issue=4 |pages=1236–1265 }}
*{{cite journal |doi=10.1214/09-AOAS312REJ |title=Rejoinder: Brownian distance covariance |year=2009b |last1=Székely |first1=Gábor J. |last2=Rizzo |first2=Maria L. |journal=[[The Annals of Applied Statistics]] |volume=3 |issue=4 |pages=1303–1308 |doi-access=free |arxiv=1010.0844 }}
*{{cite journal |last1=Székely |first1=Gábor J. |last2=Rizzo |first2=Maria L. |title=On the uniqueness of distance covariance |journal=[[Statistics & Probability Letters]] |year=2012 |volume=82 |issue=12 |pages=2278–2282 |doi=10.1016/j.spl.2012.08.007}}
*{{cite journal |arxiv=1310.2926 |last1=Székely |first1=Gabor J. |last2=Rizzo |first2=Maria L. |title=Partial Distance Correlation with Methods for Dissimilarities |journal=[[The Annals of Statistics]] |volume=42 |issue=6 |pages=2382–2412 |year=2014 |doi=10.1214/14-AOS1255 |bibcode=2014arXiv1310.2926S |s2cid=55801702 }}
==External links==
*[http://personal.bgsu.edu/~mrizzo/energy.htm E-statistics (energy statistics)] {{Webarchive|url=https://web.archive.org/web/20190913232038/http://personal.bgsu.edu/~mrizzo/energy.htm |date=2019-09-13 }}
{{DEFAULTSORT:Distance Correlation}}
|