Genome-wide complex trait analysis: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:42, 3 September 2023 edit 50.234.189.42 (talk) Removed much of the discussion of twin studies and intelligence from the introduction, which is not specifically relevant to GCTA. ← Previous edit		Latest revision as of 17:14, 5 June 2024 edit undo Josve05a (talk \| contribs) Autopatrolled, Extended confirmed users, New page reviewers, Pending changes reviewers, Rollbackers 157,544 edits m →Disadvantages: \| Add: authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. \| Use this tool. Report bugs. \| #UCB_Gadget \| Altered template type. Add: pmid, doi, pages, issue, volume, journal, title, date, pmc, authors 1-10. Removed URL that duplicated identifier. Changed bare reference to CS1/2. \| Use this tool. Report bugs. \| #UCB_Gadget
(5 intermediate revisions by 3 users not shown)
Line 16: == History == Estimation in biology/animal breeding using standard [[Analysis of variance\|ANOVA]]/[[Restricted maximum likelihood\|REML]] methods of variance components such as heritability, shared-environment, maternal effects etc. typically requires individuals of known relatedness such as parent/child; this is often unavailable or the pedigree data unreliable, leading to inability to apply the methods or requiring strict laboratory control of all breeding (which threatens the [[external validity]] of all estimates), and several authors have noted that relatedness could be measured directly from genetic markers (and if individuals were reasonably related, economically few markers would have to be obtained for statistical power), leading Kermit Ritland to propose in 1996 that directly measured pairwise relatedness could be compared to pairwise phenotype measurements (Ritland 1996, [http://www.genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldModel.pdf "A Marker-based Method for Inferences About Quantitative Inheritance in Natural Populations"] {{Webarchive\|url=https://web.archive.org/web/20090611224719/http://genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldModel.pdf \|date=2009-06-11 }}<ref>see also Ritland 1996b, [http://genetics.forestry.ubc.ca/ritland/reprints/1996_GenetResearch_r.pdf "Estimators for pairwise relatedness and individual inbreeding coefficients"] {{Webarchive\|url=https://web.archive.org/web/20170116084901/http://genetics.forestry.ubc.ca/ritland/reprints/1996_GenetResearch_r.pdf \|date=2017-01-16 }}; Ritland & Ritland 1996, [http://genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldMimulus.pdf "Inferences about quantitative inheritance based on natural population structure in the yellow monkeyflower, ''Mimulus guttatus''"] {{Webarchive\|url=https://web.archive.org/web/20160924204921/http://www.genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldMimulus.pdf \|date=2016-09-24 }}; Lynch & Ritland 1999, [http://www.genetics.org/content/152/4/1753.full "Estimation of Pairwise Relatedness With Molecular Markers"]; Ritland 2000, [http://www.genetics.forestry.ubc.ca/RITLAND/reprints/2000_ME_Review.pdf "Marker-inferred relatedness as a tool for detecting heritability in nature"] {{Webarchive\|url=https://web.archive.org/web/20160925061647/http://www.genetics.forestry.ubc.ca/RITLAND/reprints/2000_ME_Review.pdf \|date=2016-09-25 }}; Thomas 2005, [https://www.dropbox.com/s/45kxuo2p00lii6k/2005-thomas.pdf "The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations"]</ref>). As genome sequencing costs dropped steeply over the 2000s, acquiring enough markers on enough subjects for reliable estimates using very distantly related individuals became possible. An early application of the method to humans came with Visscher et al. 2006<ref>Visscher et al 2006, [http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020041 "Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings"]</ref>/2007,<ref>Visscher et al 2007, [http://www.sciencedirect.com/science/article/pii/S0002929707638841 "Genome partitioning of genetic variation for height from 11,214 sibling pairs"]</ref> which used SNP markers to estimate the actual relatedness of siblings and estimate heritability from the direct genetics. In humans, unlike the original animal/plant applications, relatedness is usually known with high confidence in the 'wild population', and the benefit of GCTA is connected more to avoiding assumptions of classic behavioral genetics designs and verifying their results, and partitioning heritability by SNP class and chromosomes. The first use of GCTA proper in humans was published in 2010, finding 45% of variance in human height can be explained by the included SNPs.<ref name="Yang2010">[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232052/ "Common SNPs explain a large proportion of heritability for human height"], Yang et al 2010</ref><ref>"[https://www.ncbi.nlm.nih.gov/pubmed/21142928 A Commentary on ‘Common SNPs Explain a Large Proportion of the Heritability for Human Height’ by Yang et al. (2010)"], Visscher et al 2010</ref> (Large GWASes on height have since confirmed the estimate.<ref name="Wood2014">[http://neurogenetics.qimrberghofer.edu.au/papers/Wood2014NatGenet.pdf "Defining the role of common variation in the genomic and biological architecture of adult human height"], Wood et al 2014</ref>) The GCTA algorithm was then described and a software implementation published in 2011.<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/ "GCTA: A Tool for Genome-wide Complex Trait Analysis"], Yang et al 2011</ref> It has since been used to study a wide variety of biological, medical, psychiatric, and psychological traits in humans, and inspired many variant approaches. Line 40: # Limited inference: GCTA estimates are inherently limited in that they cannot estimate broadsense heritability like twin/family studies as they only estimate the heritability due to SNPs. Hence, while they serve as a critical check on the unbiasedness of the twin/family studies, GCTAs cannot replace them for estimating total genetic contributions to a trait. # Substantial data requirements: the number of SNPs genotyped per person should be in the thousands and ideally the hundreds of thousands for reasonable estimates of genetic similarity (although this is no longer such an issue for current commercial chips which default to hundreds of thousands or millions of markers); and the number of persons, for somewhat stable estimates of plausible SNP heritability, should be at least ''n''>1000 and ideally ''n''>10000.<ref>"GCTA will eventually provide direct DNA tests of quantitative genetic results based on twin and adoption studies. One problem is that many thousands of individuals are required to provide reliable estimates. Another problem is that more SNPs are needed than even the million SNPs genotyped on current SNP microarrays because there is much DNA variation not captured by these SNPs. As a result, GCTA cannot estimate all heritability, perhaps only about half of the heritability. The first reports of GCTA analyses estimate heritability to be about half the heritability estimates from twin and adoption studies for height (Lee, Wray, Goddard, & Visscher, 2011; Yang et al., 2010; Yang, Manolio, et al" 2011), and intelligence (Davies et al., 2011)." pg110, [https://www.dropbox.com/s/1iz7o1hqb8isas2/2012-plomin-behavioralgenetics.pdf ''Behavioral Genetics''], Plomin et al 2012</ref> In contrast, twin studies can offer precise estimates with a fraction of the sample size. # Computational inefficiency: The original GCTA implementation scales poorly with increasing data size (<math>\mathcal{O}(\text{SNPs} \cdot n^2)</math>), so even if enough data is available for precise GCTA estimates, the computational burden may be unfeasible. GCTA can be meta-analyzed as a standard precision-weighted fixed-effect meta-analysis,<ref>[http://gcta.freeforums.net/thread/213/analysis-greml-results-multiple-cohorts "Meta-analysis of GREML results from multiple cohorts"], Yang 2015</ref> so research groups sometimes estimate cohorts or subsets and then pool them meta-analytically (at the cost of additional complexity and some loss of precision). This has motivated the creation of faster implementations and variant algorithms which make different assumptions, such as using [[Method of moments (statistics)\|moment matching]].<ref>~~[http://~~{{Cite bioRxiv \|biorxiv=10.~~org/content/early/2016/08/18~~1101/070177 "\|first1=Tian \|last1=Ge \|first2=Chia-Yen \|last2=Chen \|title=Phenome-wide Heritability Analysis of the UK Biobank~~"],~~ Ge\|date=2016 et\|last3=Neale al\|first3=Benjamin ~~2016~~M. \|last4=Sabuncu \|first4=Mert R. \|last5=Smoller \|first5=Jordan W.}}</ref> # Need for raw data: GCTA requires genetic similarity of all subjects and thus their raw genetic information; due to privacy concerns, individual patient data is rarely shared. GCTA cannot be run on the summary statistics reported publicly by many GWAS projects, and if pooling multiple GCTA estimates, a [[meta-analysis]] must be performed. <br> In contrast, there are alternative techniques which operate on summaries reported by GWASes without requiring the raw data<ref>Pasaniuc & Price 2016, [https://www.dropbox.com/s/4mgmun29xbund7z/2016-pasaniuc.pdf "Dissecting the genetics of complex traits using summary association statistics"]</ref> e.g. "[[Linkage disequilibrium score regression\|LD score regression]]"<ref>~~[https://www~~{{cite journal \| pmc=4495769 \| date=2015 \| last1=Bulik-Sullivan \| first1=B.~~ncbi~~ K.~~nlm~~ \| last2=Loh \| first2=P.~~nih~~ R.~~gov/pmc/articles/PMC4495769/~~ "\| last3=Finucane \| first3=H. \| last4=Ripke \| first4=S. \| last5=Yang \| first5=J. \| author6=Schizophrenia Working Group of the Psychiatric Genomics Consortium \| last7=Patterson \| first7=N. \| last8=Daly \| first8=M. J. \| last9=Price \| first9=A. L. \| last10=Neale \| first10=B. M. \| title=LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies~~"],~~ ~~Bulik-Sullivan~~\| etjournal=Nature alGenetics ~~2015~~\| volume=47 \| issue=3 \| pages=291–295 \| doi=10.1038/ng.3211 \| pmid=25642630 }}</ref> contrasts [[linkage disequilibrium]] statistics (available from public datasets like [[1000 Genomes]]) with the public summary effect-sizes to infer heritability and estimate genetic correlations/overlaps of multiple traits. The [[Broad Institute]] runs [http://ldsc.broadinstitute.org/about/ LD Hub] {{Webarchive\|url=https://web.archive.org/web/20160511100955/http://ldsc.broadinstitute.org/about/ \|date=2016-05-11 }} which provides a public web interface to >=177 traits with LD score regression.<ref>[http://biorxiv.org/content/biorxiv/early/2016/05/03/051094.full.pdf "LD Hub: a centralized database and web interface to LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis"], Zheng et al 2016</ref> Another method using summary data is HESS.<ref>[http://biorxiv.org/content/early/2016/01/14/035907 "Contrasting the genetic architecture of 30 complex traits from summary association data"], Shi et al 2016</ref> # Confidence intervals may be incorrect, or outside the 0-1 range of heritability, and highly imprecise due to asymptotics.<ref>~~[http:/~~{{cite journal \| doi=10.1016/~~www~~j.~~sciencedirect~~ajhg.~~com/science/article/pii/S0002929716301434~~2016.04.016 \| "title=Fast and Accurate Construction of Confidence Intervals for Heritability~~"],~~ ~~Schweiger~~\| etjournal=The alAmerican Journal of Human Genetics \| date=2 June 2016 \| volume=98 \| issue=6 \| pages=1181–1192 \| last1=Schweiger \| first1=Regev \| last2=Kaufman \| first2=Shachar \| last3=Laaksonen \| first3=Reijo \| last4=Kleber \| first4=Marcus E. \| last5=März \| first5=Winfried \| last6=Eskin \| first6=Eleazar \| last7=Rosset \| first7=Saharon \| last8=Halperin \| first8=Eran \| pmid=27259052 \| pmc=4908190 }}</ref> # Underestimation of SNP heritability: GCTA implicitly assumes all classes of SNPs, rarer or commoner, newer or older, more or less in linkage disequilibrium, have the same effects on average; in humans, rarer and newer variants tend to have larger and more negative effects<ref>[https://www.dropbox.com/s/idh2vm1dkar3qho/2017-gazal.pdf "Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection"], Gazal et al 2017</ref> as they represent [[mutation load]] being purged by [[Negative selection (natural selection)\|negative selection]]. As with measurement error, this will bias GCTA estimates towards underestimating heritability. == Interpretation == GCTA provides an unbiased estimate of the total variance in phenotype explained by all variants included in the relatedness matrix (and any variation correlated with those SNPs). This estimate can also be interpreted as the maximum prediction accuracy (R^2) that could be achieved from a linear predictor using all SNPs in the relatedness matrix. The latter interpretation is particularly relevant to the development of Polygenic Risk Scores, as it defines their maximum accuracy. GCTA estimates are sometimes misinterpreted as estimates of total (or narrow-sense, i.e. additive) heritability, but this is not a ~~guarentee~~guarantee of the method. GCTA estimates are likewise sometimes misinterpreted as "lower bounds" on the narrow-sense heritability but this is also incorrect: first because GCTA estimates can be biased (including biased upwards) if the model assumptions are violated, and second because, by definition (and when model assumptions are met), GCTA can provide an unbiased estimate of the narrow-sense heritability if all causal variants are included in the relatedness matrix. The interpretation of the GCTA estimate in relation to the narrow-sense heritability thus depends on the variants used to construct the relatedness matrix. Most frequently, GCTA is run with a single relatedness matrix constructed from common SNPs and will not capture (or not fully capture) the contribution of the following factors: Line 113: * GEMMA<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3386377/ "Genome-wide efficient mixed-model analysis for association studies"], Zhou & Stephens 2012</ref> * EMMAX<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3092069/ "Variance component model to account for sample structure in genome-wide association studies"], Kang et al 2012</ref> * [http://www.epcc.ed.ac.uk/projects-portfolio/reacta REACTA (formerly ACTA)] {{Webarchive\|url=https://web.archive.org/web/20160523120255/http://www.epcc.ed.ac.uk/projects-portfolio/reacta \|date=2016-05-23 }} claims order of magnitude runtime reductions<ref>[https://web.archive.org/web/20160522235202/http://bioinformatics.oxfordjournals.org/content/early/2012/09/27/bioinformatics.bts571.full.pdf "Advanced Complex Trait Analysis"], Gray et al 2012</ref><ref>[https://www.semanticscholar.org/paper/Regional-heritability-advanced-complex-trait-Cebamanos-Gray/c340835e1baf4b9fcafbfb001841bbd4793f598f/pdf "Regional Heritability Advanced Complex Trait Analysis for GPU and Traditional Parallel Architecture"], Cebamanos et al 2012</ref> * [http://www.hsph.harvard.edu/alkes-price/software/ BOLT-REML]/BOLT-LMM<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342297/ "Efficient Bayesian mixed model analysis increases association power in large cohorts"], Loh et al 2012</ref> ([https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ manual] {{Webarchive\|url=https://web.archive.org/web/20160611003139/https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ \|date=2016-06-11 }}), faster & better scaling;<ref name="Loh2015">[http://biorxiv.org/content/biorxiv/early/2015/06/05/016527.full.pdf "Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis"], Loh et al 2015; see also [http://biorxiv.org/content/early/2015/06/05/016527 "Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis"], Loh et al 2015</ref> with potentially better efficiency in the meta-analysis scenario<ref>[http://biorxiv.org/content/early/2015/05/29/020115 "Mixed Models for Meta-Analysis and Sequencing"], Bulik-Sullivan 2015</ref> * [http://scholar.harvard.edu/tge/software/megha MEGHA]<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4345618/ "Massively expedited genome-wide heritability analysis (MEGHA)"], Ge et al 2015</ref> * PLINK >1.9 (December 2013) supports [https://www.cog-genomics.org/plink2/ "the use of genetic relationship matrices in mixed model association analysis and other calculations"] Line 135: * [http://espace.library.uq.edu.au/view/UQ:342517/UQ342517_OA.pdf "Research review: Polygenic methods and their application to psychiatric traits"], Wray et al. 2014 * [http://medicine.tums.ac.ir:803/Users/Javad_TavakoliBazzaz/Medical%20Genetics-2/Heritability%20in%20the%20genomics%20era.pdf "Heritability in the genomics era — concepts and misconceptions"] {{Webarchive\|url=https://web.archive.org/web/20160522201032/http://medicine.tums.ac.ir:803/Users/Javad_TavakoliBazzaz/Medical%20Genetics-2/Heritability%20in%20the%20genomics%20era.pdf \|date=2016-05-22 }}, Visscher et al. 2008 * [http://www.sciencedirect.com/science/article/pii/S2001037015000458 "Uncovering the Genetic Architectures of Quantitative Traits"], Lee et al. 2016 * [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12129/pdf "Estimating heritability using genomic data"], Stanton-Geddes et al. 2013 Line 154: * [https://www.youtube.com/watch?v=b32OwqBPHkI "Genomics, Big Data, Medicine, and Complex Traits"] (Peter Visscher talk) * [https://www.dropbox.com/s/1otmbu840xejjv1/MCTFR_talk.pdf "The Genetic Architectures of Psychological Traits"], Lee 2014 slides * [https://www.youtube.com/watch?v=VI-5HlYQpNE "Heritability-based models for prediction of complex traits"], [https://sites.google.com/site/baldingstatisticalgenetics/home David Balding] {{Webarchive\|url=https://web.archive.org/web/20161008184835/https://sites.google.com/site/baldingstatisticalgenetics/home \|date=2016-10-08 }} 2015 [[Category:Behavioural genetics]]