Genome-wide complex trait analysis: Difference between revisions

Content deleted Content added
AFD closed as keep (XFDcloser)
Line 33:
# Substantial data requirements: the number of SNPs genotyped per person should be in the thousands and ideally the hundreds of thousands for reasonable estimates of genetic similarity (although this is no longer such an issue for current commercial chips which default to hundreds of thousands or millions of markers); and the number of persons, for somewhat stable estimates of plausible SNP heritability, should be at least ''n''>1000 and ideally ''n''>10000.<ref>"GCTA will eventually provide direct DNA tests of quantitative genetic results based on twin and adoption studies. One problem is that many thousands of individuals are required to provide reliable estimates. Another problem is that more SNPs are needed than even the million SNPs genotyped on current SNP microarrays because there is much DNA variation not captured by these SNPs. As a result, GCTA cannot estimate all heritability, perhaps only about half of the heritability. The first reports of GCTA analyses estimate heritability to be about half the heritability estimates from twin and adoption studies for height (Lee, Wray, Goddard, & Visscher, 2011; Yang et al., 2010; Yang, Manolio, et al" 2011), and intelligence (Davies et al., 2011)." pg110, [https://www.dropbox.com/s/1iz7o1hqb8isas2/2012-plomin-behavioralgenetics.pdf ''Behavioral Genetics''], Plomin et al 2012</ref> In contrast, twin studies can offer precise estimates with a fraction of the sample size.
# Computational inefficiency: The original GCTA implementation scales poorly with increasing data size (<math>\mathcal{O}(\text{SNPs} \cdot n^2)</math>), so even if enough data is available for precise GCTA estimates, the computational burden may be unfeasible. GCTA can be meta-analyzed as a standard precision-weighted fixed-effect meta-analysis,<ref>[http://gcta.freeforums.net/thread/213/analysis-greml-results-multiple-cohorts "Meta-analysis of GREML results from multiple cohorts"], Yang 2015</ref> so research groups sometimes estimate cohorts or subsets and then pool them meta-analytically (at the cost of additional complexity and some loss of precision). This has motivated the creation of faster implementations and variant algorithms which make different assumptions, such as using [[Method of moments (statistics)|moment matching]]<ref>[http://biorxiv.org/content/early/2016/08/18/070177 "Phenome-wide Heritability Analysis of the UK Biobank"], Ge et al 2016</ref>
# Need for raw data: GCTA requires genetic similarity of all subjects and thus their raw genetic information; due to privacy concerns, individual patient data is rarely shared. GCTA cannot be run on the summary statistics reported publicly by many GWAS projects, and if pooling multiple GCTA estimates, [[meta-analysis]] must be done. <br> In contrast, there are alternative techniques which operate on summaries reported by GWASes without requiring the raw data<ref>Pasaniuc & Price 2016, [https://www.dropbox.com/s/4mgmun29xbund7z/2016-pasaniuc.pdf "Dissecting the genetics of complex traits using summary association statistics"]</ref> e.g. "[[linkageLinkage disequilibrium score regression|LD]] score regression]]"<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4495769/ "LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies"], Bulik-Sullivan et al 2015</ref> contrasts [[linkage disequilibrium]] statistics (available from public datasets like [[1000 Genomes]]) with the public summary effect-sizes to infer heritability and estimate genetic correlations/overlaps of multiple traits. The [[Broad Institute]] runs [http://ldsc.broadinstitute.org/about/ LD Hub] which provides a public web interface to >=177 traits with LD score regression.<ref>[http://biorxiv.org/content/biorxiv/early/2016/05/03/051094.full.pdf "LD Hub: a centralized database and web interface to LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis"], Zheng et al 2016</ref> Another method using summary data is HESS.<ref>[http://biorxiv.org/content/early/2016/01/14/035907 "Contrasting the genetic architecture of 30 complex traits from summary association data"], Shi et al 2016</ref>
# Confidence intervals may be incorrect, or outside the 0-1 range of heritability, and highly imprecise due to asymptotics<ref>[http://www.sciencedirect.com/science/article/pii/S0002929716301434 "Fast and Accurate Construction of Confidence Intervals for Heritability"], Schweiger et al 2016</ref>
# Underestimation of SNP heritability: GCTA implicitly assumes all classes of SNPs, rarer or commoner, newer or older, more or less in linkage disequilibrium, have the same effects on average; in humans, rarer and newer variants tend to have larger and more negative effects<ref>[https://www.dropbox.com/s/idh2vm1dkar3qho/2017-gazal.pdf "Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection"], Gazal et al 2017</ref> as they represent [[mutation load]] being purged by [[Negative selection (natural selection)|negative selection]]. As with measurement error, this will bias GCTA estimates towards underestimating heritability.