Genome-wide complex trait analysis: Difference between revisions

Content deleted Content added
Added original research tag, illustrated with reference 1.
Tags: Mobile edit Mobile web edit
m Disadvantages: | Add: authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this tool. Report bugs. | #UCB_Gadget | Altered template type. Add: pmid, doi, pages, issue, volume, journal, title, date, pmc, authors 1-10. Removed URL that duplicated identifier. Changed bare reference to CS1/2. | Use this tool. Report bugs. | #UCB_Gadget
 
(28 intermediate revisions by 20 users not shown)
Line 1:
{{short description|Statistical method for genetic variance component estimation}}
{{Redirect|GCTA|the TV camera used in the Apollo space program|Apollo TV camera#RCA J-Series Ground-Commanded Television Assembly (GCTA){{!}}Apollo TV camera}}
{{Multiple issues|
{{essay-like}
{{essay-like|date=February 2020}}
{{technical|date=January 2017}}
{{Original research|date=September 2019}}
}}
'''Genome-wide complex trait analysis (GCTA) Genome-based [[restricted maximum likelihood]] (GREML)''' is a statistical method for [[variance]] component estimation in genetics which quantifies the total narrow-sense (additive) contribution to a trait's [[heritability]] of a particular subset of genetic variants (typically limited to [[Single-nucleotide polymorphism|SNPs]] with [[Minor allele frequency|MAF]] >1%, hence terms such as "chip heritability"/"SNP heritability"). This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness.<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232052/figure/F3/ Figure 3] of Yang et al 2010, or Figure 3 of Ritland & Ritland 1996</ref> The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development.<ref name="Deary2012">[https://www.researchgate.net/profile/David_Dave_Liewald/publication/221760226_Genetic_contributions_to_stability_and_change_in_intelligence_from_childhood_to_old_age/links/02e7e52ca9a723a8fa000000.pdf "Genetic contributions to stability and change in intelligence from childhood to old age"], Deary et al 2012</ref>. It can also be extended to analyse bivariate [[genetic correlation]]s between traits.<ref name="Lee2012">Lee et al 2012, [http://bioinformatics.oxfordjournals.org/content/28/19/2540.full "Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood"]</ref> There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data.<ref>{{Cite journal |last=Krishna Kumar |first=Siddharth |last2=Feldman |first2=Marcus W. |last3=Rehkopf |first3=David H. |last4=Tuljapurkar |first4=Shripad |date=2016-01-05 |title=Limitations of GCTA as a solution to the missing heritability problem |journal=Proceedings of the National Academy of Sciences of the United States of America |volume=113 |issue=1 |pages=E61–70 |doi=10.1073/pnas.1520109113 |issn=1091-6490 |pmc=4711841 |pmid=26699465}}</ref> The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to [[population stratification]].<ref>{{cite journal |last1=BURT |first1=CALLIE H. |last2=SIMONS |first2=RONALD L. |title=HERITABILITY STUDIES IN THE POSTGENOMIC ERA: THE FATAL FLAW IS CONCEPTUAL |journal=Criminology |date=February 2015 |volume=53 |issue=1 |pages=103–112 |doi=10.1111/1745-9125.12060}}</ref>
'''Genome-wide complex trait analysis''' ('''GCTA''') '''Genome-based [[restricted maximum likelihood]]''' ('''GREML''') is a statistical method for [[heritability]] estimation in genetics, which quantifies the total additive contribution of a set of genetic variants to a trait. GCTA is typically applied to common single nucleotide polymorphisms ([[SNPs]]) on a genotyping array (or "chip") and thus termed "chip" or "SNP" heritability.
 
GCTA operates by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness.<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232052/figure/F3/ Figure 3] of Yang et al 2010, or Figure 3 of Ritland & Ritland 1996</ref> GCTA makes a number of modeling assumptions and whether/when these assumptions are satisfied continues to be debated.
 
The GCTA framework has also been extended in a number of ways: quantifying the contribution from multiple SNP categories (i.e. functional partitioning); quantifying the contribution of Gene-Environment interactions; quantifying the contribution of non-additive/non-linear effects of SNPs; and bivariate analyses of multiple phenotypes to quantify their genetic covariance (co-heritability or [[genetic correlation]]).
 
GCTA estimates have implications for the potential for discovery from [[Genome-wide association study|Genome-wide Association Studies (GWAS)]] as well as the design and accuracy of [[polygenic scores]]. GCTA estimates from common variants are typically substantially lower than other estimates of total or narrow-sense heritability (such as from twin or kinship studies), which has contributed to the debate over the [[Missing heritability problem]].
GCTA heritability estimates are useful because they provide lower bounds<ref>{{Cite journal |last=Duncan |first=L. E. |last2=Ratanatharathorn |first2=A. |last3=Aiello |first3=A. E. |last4=Almli |first4=L. M. |last5=Amstadter |first5=A. B. |last6=Ashley-Koch |first6=A. E. |last7=Baker |first7=D. G. |last8=Beckham |first8=J. C. |last9=Bierut |first9=L. J. |date=March 2018 |title=Largest GWAS of PTSD (N=20 070) yields genetic overlap with schizophrenia and sex differences in heritability |journal=Molecular Psychiatry |volume=23 |issue=3 |pages=666–673 |doi=10.1038/mp.2017.77 |issn=1476-5578 |pmc=5696105 |pmid=28439101 |quote="A common misconception about SNP-chip heritability estimates calculated with GCTA and LDSC is that they should be similar to twin study estimates, when in reality twin studies have the advantage of capturing all genetic effects—common, rare and those not genotyped by available methods. Thus, the assumption should be that h2SNP < h2TWIN when using GCTA and LDSC, and this is what we observe for PTSD, as has been observed for many other phenotypes.}}</ref> for the genetic contributions to traits such as [[Heritability of IQ|intelligence]] without relying on the assumptions used in [[twin study|twin studies]] and other family and [[Genealogy|pedigree]] studies, thereby corroborating them<ref>Eric Turkheimer ([http://people.virginia.edu/~ent3c/papers2/StillMissingFinal.pdf "Still Missing"], Turkheimer 2011) discusses the GCTA results in the context of the twin study debate: "Of the three reservations about quantitative genetic heritability that were outlined at the outset—the assumptions of twin and family studies, the universality of heritability, and the absence of mechanism—the new paradigm has put the first to rest, and before continuing to explain my skepticism about whether the most important problems have been solved, it is worth appreciating what a significant accomplishment this is. Thanks to the Visscher program of research, it should now be impossible to argue that the whole body of quantitative genetic research showing the universal importance of genes for human development was somehow based on a sanguine view of the equal environments assumption in twin studies, putting an end to an entire misguided school of thought among traditional opponents of classical quantitative (and by association behavioral) genetics (e.g., Joseph, 2010; Kamin & Goldberger, 2002)"; see also [https://www.vox.com/the-big-idea/2017/5/18/15655638/charles-murray-race-iq-sam-harris-science-free-speech Turkheimer, Harden, & Nisbett]: "These methods have given scientists a new way to compute heritability: Studies that measure DNA sequence variation directly have shown that pairs of people who are not relatives, but who are slightly more similar genetically, also have more similar IQs than other pairs of people who happen to be more different genetically. These “DNA-based” heritability studies don’t tell you much more than the classical twin studies did, but they put to bed many of the lingering suspicions that twin studies were fundamentally flawed in some way. Like the validity of intelligence testing, the heritability of intelligence is no longer scientifically contentious."</ref><ref>"This finding of strong genome-wide pleiotropy across diverse cognitive and learning abilities, indexed by general intelligence, is a major finding about the origins of individual differences in intelligence. Nonetheless, this finding seems to have had little impact in related fields such as cognitive neuroscience or experimental cognitive psychology. We suggest that part of the reason for this neglect is that these fields generally ignore individual differences.65,66 Another reason might be that the evidence for this finding rested largely on the twin design, for which there have always been concerns about some of its assumptions;6 we judge that this will change now that GCTA is beginning to confirm the twin results." --[http://www.nature.com/mp/journal/vaop/ncurrent/full/mp2014105a.html "Genetics and intelligence differences: five special findings"], Plomin & Deary 2015</ref><ref>[http://www.gwern.net/docs/genetics/2016-plomin.pdf "Top 10 Replicated Findings From Behavioral Genetics"], Plomin et al., 2016: "This research has primarily relied on the twin design in which the resemblance of identical and fraternal twins is compared and the adoption design in which the resemblance of relatives separated by adoption is compared. Although the twin and adoption designs have been criticized separately (Plomin et al., 2013), these two designs generally converge on the same conclusion despite being based on very different assumptions, which adds strength to these conclusions...GCTA underestimates genetic influence for several reasons and requires samples of several thousand individuals to reveal the tiny signal of chance genetic similarity from the noise of DNA differences across the genome (Vinkhuyzen, Wray, Yang, Goddard, & Visscher, 2013). Nonetheless, GCTA has consistently yielded evidence for significant genetic influence for cognitive abilities (Benyamin et al., 2014; Davies et al., 2015; St. Pourcain et al., 2014), psychopathology (L. K. Davis et al., 2013; Gaugler et al., 2014; Klei et al., 2012; Lubke et al., 2012, 2014; McGue et al., 2013; Ripke et al., 2013; Wray et al., 2014), personality (C. A. Rietveld, Cesarini, et al., 2013; Verweij et al., 2012; Vinkhuyzen et al., 2012), and substance use or drug dependence (Palmer et al., 2015; Vrieze, McGue, Miller, Hicks, & Iacono, 2013), thus supporting the results of twin and adoption studies."</ref> and enabling the design of well-[[statistical power|powered]] [[Genome-wide association study]] (GWAS) designs to find the specific genetic variants involved. For example, a GCTA estimate of 30% SNP heritability is consistent with a larger total genetic heritability of 70%. However, if the GCTA estimate was ~0%, then that would imply one of three things: a) there is no genetic contribution, b) the genetic contribution is entirely in the form of genetic variants not included, or c) the genetic contribution is entirely in the form of non-additive effects such as [[epistasis]]/[[Dominance (genetics)|dominance]]. Running GCTA on individual chromosomes and regressing the estimated proportion of trait variance explained by each chromosome against that chromosome's length can reveal whether the responsible genetic variants cluster or are distributed evenly across the genome or are [[sex-linked]]. Chromosomes can of course be replaced by more fine-grained or functionally informed subdivisions. Examining genetic correlations can reveal to what extent observed correlations, such as between intelligence and socioeconomic status, are due to the same genetic traits, and in the case of diseases, can indicate shared causal pathways such as can be inferred from the genetic variation jointly associated with schizophrenia and other mental diseases or reduced intelligence.
 
== History ==
 
Estimation in biology/animal breeding using standard [[Analysis of variance|ANOVA]]/[[Restricted maximum likelihood|REML]] methods of variance components such as heritability, shared-environment, maternal effects etc. typically requires individuals of known relatedness such as parent/child; this is often unavailable or the pedigree data unreliable, leading to inability to apply the methods or requiring strict laboratory control of all breeding (which threatens the [[external validity]] of all estimates), and several authors have noted that relatedness could be measured directly from genetic markers (and if individuals were reasonably related, economically few markers would have to be obtained for statistical power), leading Kermit Ritland to propose in 1996 that directly measured pairwise relatedness could be compared to pairwise phenotype measurements (Ritland 1996, [http://www.genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldModel.pdf "A Marker-based Method for Inferences About Quantitative Inheritance in Natural Populations"] {{Webarchive|url=https://web.archive.org/web/20090611224719/http://genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldModel.pdf |date=2009-06-11 }}<ref>see also Ritland 1996b, [http://genetics.forestry.ubc.ca/ritland/reprints/1996_GenetResearch_r.pdf "Estimators for pairwise relatedness and individual inbreeding coefficients"] {{Webarchive|url=https://web.archive.org/web/20170116084901/http://genetics.forestry.ubc.ca/ritland/reprints/1996_GenetResearch_r.pdf |date=2017-01-16 }}; Ritland & Ritland 1996, [http://genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldMimulus.pdf "Inferences about quantitative inheritance based on natural population structure in the yellow monkeyflower, ''Mimulus guttatus''"] {{Webarchive|url=https://web.archive.org/web/20160924204921/http://www.genetics.forestry.ubc.ca/ritland/reprints/1996_Evolution_HeritInFieldMimulus.pdf |date=2016-09-24 }}; Lynch & Ritland 1999, [http://www.genetics.org/content/152/4/1753.full "Estimation of Pairwise Relatedness With Molecular Markers"]; Ritland 2000, [http://www.genetics.forestry.ubc.ca/RITLAND/reprints/2000_ME_Review.pdf "Marker-inferred relatedness as a tool for detecting heritability in nature"] {{Webarchive|url=https://web.archive.org/web/20160925061647/http://www.genetics.forestry.ubc.ca/RITLAND/reprints/2000_ME_Review.pdf |date=2016-09-25 }}; Thomas 2005, [https://www.dropbox.com/s/45kxuo2p00lii6k/2005-thomas.pdf "The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations"]</ref>).
 
As genome sequencing costs dropped steeply over the 2000s, acquiring enough markers on enough subjects for reliable estimates using very distantly related individuals became possible. An early application of the method to humans came with Visscher et al. 2006<ref>Visscher et al 2006, [http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020041 "Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings"]</ref>/2007,<ref>Visscher et al 2007, [http://www.sciencedirect.com/science/article/pii/S0002929707638841 "Genome partitioning of genetic variation for height from 11,214 sibling pairs"]</ref> which used SNP markers to estimate the actual relatedness of siblings and estimate heritability from the direct genetics. In humans, unlike the original animal/plant applications, relatedness is usually known with high confidence in the 'wild population', and the benefit of GCTA is connected more to avoiding assumptions of classic behavioral genetics designs and verifying their results, and partitioning heritability by SNP class and chromosomes. The first use of GCTA proper in humans was published in 2010, finding 45% of variance in human height can be explained by the included SNPs.<ref name="Yang2010">[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232052/ "Common SNPs explain a large proportion of heritability for human height"], Yang et al 2010</ref><ref>"[httphttps://emilkirkegaardwww.dkncbi.nlm.nih.gov/enpubmed/wp-content/uploads/A-Commentary-on-Common-SNPs-Explain-a-Large-Proportion-of-the-Heritability-for-Human-Height-by-Yang-et-al.-2010.pdf21142928 "A Commentary on ‘Common SNPs Explain a Large Proportion of the Heritability for Human Height’ by Yang et al. (2010)"], Visscher et al 2010</ref> (Large GWASes on height have since confirmed the estimate.<ref name="Wood2014">[http://neurogenetics.qimrberghofer.edu.au/papers/Wood2014NatGenet.pdf "Defining the role of common variation in the genomic and biological architecture of adult human height"], Wood et al 2014</ref>) The GCTA algorithm was then described and a software implementation published in 2011.<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/ "GCTA: A Tool for Genome-wide Complex Trait Analysis"], Yang et al 2011</ref> It has since been used to study a wide variety of biological, medical, psychiatric, and psychological traits in humans, and inspired many variant approaches.
 
== Benefits ==
Line 20 ⟶ 26:
{{main|Twin study#Criticism}}
 
Twin and family studies have long been used to estimate variance explained by particular categories of genetic and environmental causes. Across a wide variety of human traits studied, there is typically minimal shared-environment influence, considerable non-shared environment influence, and a large genetic component (mostly additive), which is on average ~50% and sometimes much higher for some traits such as height or intelligence.<ref>[http://www.gwern.net/docs/genetics/2015-polderman.pdf "Meta-analysis of the heritability of human traits based on fifty years of twin studies"], Polderman et al 2015</ref> However, the twin and family studies have been criticized for their reliance on a number of assumptions that are difficult or impossible to verify, such as the equal environments assumption (that the environments of [[monozygotic]] and [[dizygotic]] twins are equally similar), that there is no misclassification of zygosity (mistaking identical for fraternal & vice versa), that twins are unrepresentative of the general population, and that there is no [[assortative mating]]. Violations of these assumptions can result in both upwards and downwards bias of the parameter estimates.<ref>{{Cite journal|lastlast1=Barnes|firstfirst1=J. C.|last2=Wright|first2=John Paul|last3=Boutwell|first3=Brian B.|last4=Schwartz|first4=Joseph A.|last5=Connolly|first5=Eric J.|last6=Nedelec|first6=Joseph L.|last7=Beaver|first7=Kevin M.|date=2014-11-01|title=Demonstrating the Validity of Twin Research in Criminology|url=https://www.researchgate.net/publication/267158254|journal=Criminology|language=en|volume=52|issue=4|pages=588–626|doi=10.1111/1745-9125.12049|issn=1745-9125}}</ref> (This debate & criticism have particularly focused on the [[heritability of IQ]].)
 
The use of SNP or whole-genome data from unrelated subject participants (with participants too related, typically >0.025 or ~fourth cousins levels of similarity, being removed, and several [[Principal component analysis|principal components]] included in the regression to avoid & control for [[population stratification]]) bypasses many heritability criticisms: twins are often entirely uninvolved, there are no questions of equal treatment, relatedness is estimated precisely, and the samples are drawn from a broad variety of subjects.
Line 34 ⟶ 40:
# Limited inference: GCTA estimates are inherently limited in that they cannot estimate broadsense heritability like twin/family studies as they only estimate the heritability due to SNPs. Hence, while they serve as a critical check on the unbiasedness of the twin/family studies, GCTAs cannot replace them for estimating total genetic contributions to a trait.
# Substantial data requirements: the number of SNPs genotyped per person should be in the thousands and ideally the hundreds of thousands for reasonable estimates of genetic similarity (although this is no longer such an issue for current commercial chips which default to hundreds of thousands or millions of markers); and the number of persons, for somewhat stable estimates of plausible SNP heritability, should be at least ''n''>1000 and ideally ''n''>10000.<ref>"GCTA will eventually provide direct DNA tests of quantitative genetic results based on twin and adoption studies. One problem is that many thousands of individuals are required to provide reliable estimates. Another problem is that more SNPs are needed than even the million SNPs genotyped on current SNP microarrays because there is much DNA variation not captured by these SNPs. As a result, GCTA cannot estimate all heritability, perhaps only about half of the heritability. The first reports of GCTA analyses estimate heritability to be about half the heritability estimates from twin and adoption studies for height (Lee, Wray, Goddard, & Visscher, 2011; Yang et al., 2010; Yang, Manolio, et al" 2011), and intelligence (Davies et al., 2011)." pg110, [https://www.dropbox.com/s/1iz7o1hqb8isas2/2012-plomin-behavioralgenetics.pdf ''Behavioral Genetics''], Plomin et al 2012</ref> In contrast, twin studies can offer precise estimates with a fraction of the sample size.
# Computational inefficiency: The original GCTA implementation scales poorly with increasing data size (<math>\mathcal{O}(\text{SNPs} \cdot n^2)</math>), so even if enough data is available for precise GCTA estimates, the computational burden may be unfeasible. GCTA can be meta-analyzed as a standard precision-weighted fixed-effect meta-analysis,<ref>[http://gcta.freeforums.net/thread/213/analysis-greml-results-multiple-cohorts "Meta-analysis of GREML results from multiple cohorts"], Yang 2015</ref> so research groups sometimes estimate cohorts or subsets and then pool them meta-analytically (at the cost of additional complexity and some loss of precision). This has motivated the creation of faster implementations and variant algorithms which make different assumptions, such as using [[Method of moments (statistics)|moment matching]].<ref>[http://{{Cite bioRxiv |biorxiv=10.org/content/early/2016/08/181101/070177 "|first1=Tian |last1=Ge |first2=Chia-Yen |last2=Chen |title=Phenome-wide Heritability Analysis of the UK Biobank"], Ge|date=2016 et|last3=Neale al|first3=Benjamin 2016M. |last4=Sabuncu |first4=Mert R. |last5=Smoller |first5=Jordan W.}}</ref>
# Need for raw data: GCTA requires genetic similarity of all subjects and thus their raw genetic information; due to privacy concerns, individual patient data is rarely shared. GCTA cannot be run on the summary statistics reported publicly by many GWAS projects, and if pooling multiple GCTA estimates, a [[meta-analysis]] must be performed. <br> In contrast, there are alternative techniques which operate on summaries reported by GWASes without requiring the raw data<ref>Pasaniuc & Price 2016, [https://www.dropbox.com/s/4mgmun29xbund7z/2016-pasaniuc.pdf "Dissecting the genetics of complex traits using summary association statistics"]</ref> e.g. "[[Linkage disequilibrium score regression|LD score regression]]"<ref>[https://www{{cite journal | pmc=4495769 | date=2015 | last1=Bulik-Sullivan | first1=B.ncbi K.nlm | last2=Loh | first2=P.nih R.gov/pmc/articles/PMC4495769/ "| last3=Finucane | first3=H. | last4=Ripke | first4=S. | last5=Yang | first5=J. | author6=Schizophrenia Working Group of the Psychiatric Genomics Consortium | last7=Patterson | first7=N. | last8=Daly | first8=M. J. | last9=Price | first9=A. L. | last10=Neale | first10=B. M. | title=LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies"], Bulik-Sullivan| etjournal=Nature alGenetics 2015| volume=47 | issue=3 | pages=291–295 | doi=10.1038/ng.3211 | pmid=25642630 }}</ref> contrasts [[linkage disequilibrium]] statistics (available from public datasets like [[1000 Genomes]]) with the public summary effect-sizes to infer heritability and estimate genetic correlations/overlaps of multiple traits. The [[Broad Institute]] runs [http://ldsc.broadinstitute.org/about/ LD Hub] {{Webarchive|url=https://web.archive.org/web/20160511100955/http://ldsc.broadinstitute.org/about/ |date=2016-05-11 }} which provides a public web interface to >=177 traits with LD score regression.<ref>[http://biorxiv.org/content/biorxiv/early/2016/05/03/051094.full.pdf "LD Hub: a centralized database and web interface to LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis"], Zheng et al 2016</ref> Another method using summary data is HESS.<ref>[http://biorxiv.org/content/early/2016/01/14/035907 "Contrasting the genetic architecture of 30 complex traits from summary association data"], Shi et al 2016</ref>
# Confidence intervals may be incorrect, or outside the 0-1 range of heritability, and highly imprecise due to asymptotics.<ref>[http:/{{cite journal | doi=10.1016/wwwj.sciencedirectajhg.com/science/article/pii/S00029297163014342016.04.016 | "title=Fast and Accurate Construction of Confidence Intervals for Heritability"], Schweiger| etjournal=The alAmerican Journal of Human Genetics | date=2 June 2016 | volume=98 | issue=6 | pages=1181–1192 | last1=Schweiger | first1=Regev | last2=Kaufman | first2=Shachar | last3=Laaksonen | first3=Reijo | last4=Kleber | first4=Marcus E. | last5=März | first5=Winfried | last6=Eskin | first6=Eleazar | last7=Rosset | first7=Saharon | last8=Halperin | first8=Eran | pmid=27259052 | pmc=4908190 }}</ref>
# Underestimation of SNP heritability: GCTA implicitly assumes all classes of SNPs, rarer or commoner, newer or older, more or less in linkage disequilibrium, have the same effects on average; in humans, rarer and newer variants tend to have larger and more negative effects<ref>[https://www.dropbox.com/s/idh2vm1dkar3qho/2017-gazal.pdf "Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection"], Gazal et al 2017</ref> as they represent [[mutation load]] being purged by [[Negative selection (natural selection)|negative selection]]. As with measurement error, this will bias GCTA estimates towards underestimating heritability.
 
== Interpretation ==
GCTA provides an unbiased estimate of the total variance in phenotype explained by all variants included in the relatedness matrix (and any variation correlated with those SNPs). This estimate can also be interpreted as the maximum prediction accuracy (R^2) that could be achieved from a linear predictor using all SNPs in the relatedness matrix. The latter interpretation is particularly relevant to the development of Polygenic Risk Scores, as it defines their maximum accuracy. GCTA estimates are sometimes misinterpreted as estimates of total (or narrow-sense, i.e. additive) heritability, but this is not a guarantee of the method. GCTA estimates are likewise sometimes misinterpreted as "lower bounds" on the narrow-sense heritability but this is also incorrect: first because GCTA estimates can be biased (including biased upwards) if the model assumptions are violated, and second because, by definition (and when model assumptions are met), GCTA can provide an unbiased estimate of the narrow-sense heritability if all causal variants are included in the relatedness matrix. The interpretation of the GCTA estimate in relation to the narrow-sense heritability thus depends on the variants used to construct the relatedness matrix.
GCTA estimates are often misinterpreted as "the total genetic contribution", and since they are often much less than the twin study estimates, the twin studies are presumed to be biased and the genetic contribution to a particular trait is minor.<ref>[https://www.independentsciencenews.org/health/still-chasing-ghosts-a-new-genetic-methodology-will-not-find-the-missing-heritability/ "Still Chasing Ghosts: A New Genetic Methodology Will Not Find the 'Missing Heritability'"], Charney 2013</ref> This is incorrect, as GCTA estimates are lower bounds.
 
Most frequently, GCTA is run with a single relatedness matrix constructed from common SNPs and will not capture (or not fully capture) the contribution of the following factors:
A more correct interpretation would be that: GCTA estimates are the expected amount of variance that could be predicted by an indefinitely large GWAS using a simple additive linear model (without any interactions or higher-order effects) in a particular population at a particular time given the limited selection of SNPs and a trait measured with a particular amount of precision. Hence, there are many ways to exceed GCTA estimates:
 
# Any rare or low-frequency variants that are not directly genotyped/imputed.
# SNP genotyping data is typically limited to 200k-1m of the most common or scientifically interesting SNPs, though 150 million+ have been documented by genome sequencing;<ref>[http://biorxiv.org/content/early/2016/07/01/061663 "Deep Sequencing of 10,000 Human Genomes"], Telenti 2015</ref> as SNP prices drop and arrays become more comprehensive or whole-genome sequencing replaces SNP genotyping entirely, the expected narrowsense heritability will increase as more genetic variants are included in the analysis. The selection can also be expanded considerably using [[haplotype]]s<ref>[http://biorxiv.org/content/biorxiv/early/2015/07/12/022418.full.pdf "Haplotypes of common SNPs can explain missing heritability of complex diseases"], Bhatia et al 2015</ref> and [[Imputation (genetics)|imputation]] (SNPs can proxy for unobserved genetic variants which they tend to be inherited with); e.g. Yang et al. 2015<ref name="Yang2015">[http://www.gwern.net/docs/genetics/2015-yang.pdf "Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index"], Yang et al 2015</ref> finds that with more aggressive use of imputation to infer unobserved variants, the height GCTA estimate expands to 56% from 45%, and Hill et al. 2017 finds that expanding GCTA to cover rarer variants raises the intelligence estimates from ~30% to ~53% and explains all the heritability in their sample;<ref name="Hill2017">Hill et al 2017, [http://biorxiv.org/content/early/2017/02/06/106203 "Genomic analysis of family data reveals additional genetic effects on intelligence and personality"]</ref> for 4 traits in the UK Biobank, imputing raised the SNP heritability estimates.<ref name="Evans2017">Evans et al 2017, [http://biorxiv.org/content/early/2017/03/10/115527 "Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits"]</ref> Additional genetic variants include ''de novo'' [[mutations]]/[[mutation load]] & [[structural variation]]s such as [[copy-number variations]].
# Any non-linear, dominance, or epistatic genetic effects. Note that GCTA can be extended to estimate the contribution of these effects through more complex relatedness matrices.
# narrowsense heritability estimates assume simple additivity of effects, ignoring interactions. As some trait values will be due to these more complicated effects, the total genetic effect will exceed that of the subset measured by GCTA, and as the additive SNPs are found and measured, it will become possible to find interactions as well using more sophisticated statistical models.
# The effects of Gene-Environment interactions. Note that GCTA can be extended to estimate the contribution of GxE interactions when the E is known, by including additional variance components.
# all correlation & heritability estimates are biased downwards to zero by the presence of [[measurement error]]; the need for adjusting this leads to techniques such as [[Spearman's correction for measurement error]], as the underestimate can be quite severe for traits where large-scale and accurate measurement is difficult and expensive,<ref>[https://www.dropbox.com/s/s1yax9n9jgkpmb1/2004-hunterschmidt-methodsofmetaanalysis.pdf ''Methods of Meta-Analysis: Correcting Error and Bias in Research Findings''], Hunter & Schmidt 2004</ref> such as intelligence. For example, an intelligence GCTA estimate of 0.31, based on an intelligence measurement with [[test-retest reliability]] <math>r=0.65</math>, would after correction (<math>\frac{0.31}{0.65}</math>), be a true estimate of ~0.48, indicating that common SNPs alone explain half of variance. Hence, a GWAS with a better measurement of intelligence can expect to find more intelligence hits than indicated by a GCTA based on a noisier measurement.
# Structural variants, which are typically not genotyped or imputed.
# Measurement error: GCTA does not model any uncertainty or error on the measured trait.
 
GCTA makes several model assumptions and may produce biased estimates under the following conditions:
 
# The distribution of causal variants is systematically different from the distribution of variants included in the relatedness matrix (even if all causal variants are included in the relatedness matrix). For example, if causal variants are systematically at a higher/lower frequency or in higher/lower correlation than all genotyped variants. This can produce either an upwards or downwards bias depending on the relationship between the causal variants and variants used. Various extensions to GCTA have been proposed (for example, GREML-LDMS) to account for these distributional shifts.
# Population stratification is not fully accounted for by covariates. GCTA (specifically GREML) accounts for stratification through the inclusion of fixed effect covariates, typically principal components. If these covariates do not fully capture the stratification the GCTA estimate will be biased, generally upwards. Accounting for recent population structure is particularly challenging for studies of rare variants.
# Residual genetic or environmental relatedness present in the data. GCTA assumes a homogenous population with an independent and identically distributed environmental term. This assumption is violated if related individuals and/or individuals with substantially shared environments are included in the data. In this case, the GCTA estimate will additionally capture the contribution of any genetic variation correlated with the genetic relationship: either direct genetic effects or correlated environment.
# The presence of "indirect" genetic effects. When genetic variants present in the relatedness matrix are correlated with variants present in other individuals that influence the participant's environment, those effects will also be captured in the GCTA estimate. For example, if variants inherited by a participant from their mother influenced their phenotype through their maternal environment, then the effect of those variants will be included in the GCTA estimate even though it is "indirect" (i.e. mediated by parental genetics). This may be interpreted as an upward bias as such "indirect" effects are not strictly causal (altering them in the participant would not lead to a change in phenotype in expectation).
 
== Implementations ==
Line 53 ⟶ 68:
| name = GCTA
| author = [[Jian Yang (geneticist)|Jian Yang]]
| released = 30{{start Augustdate and age|2010|08|30}}<ref name="version history"/>
| status ver layout = Maintainedstacked
| latest release version = 1.25.2
| latest release date version = 22 December 20151.26.0
| latest release date = {{start date and age|2016|06|22}}<ref name="version history">{{cite web
| status = Maintained
| url = https://cnsgenomics.com/software/gcta/#Download
| title = GCTA document
| website = cnsgenomics.com
| access-date = 2021-04-08
}}</ref>
| latest releasepreview version = 1.2593.22beta
| latest preview date = {{start date and age|2020|05|08}}<ref name="version history"/>
| programming language = C++
| operating system = [[Linux]]<br/> [[macOS]] (Macnot fully tested)<br/Windows> support[[Microsoft droppedWindows|Windows]] at(not v1.02fully tested)<ref name="version history"/>
| platform = [[x86_64]]
| language = English
| genre = geneticsGenetics
| license = [[GNU_General_Public_License#Version_3|GPL v3]] (source code)<br/>[[MIT License|MIT]] (executable files)<ref name="version history"/>
| website = {{URL|https://cnsgenomics.com/software/gcta/}}; forums: {{URL|gcta.freeforums.net}}
| AsOf = 228 MayApril 20162021
}}
 
Line 87 ⟶ 110:
* FAST-LMM<ref>[https://www.researchgate.net/profile/David_Heckerman/publication/51618535_FaST_linear_mixed_models_for_genome-wide_association_studies/links/5485d3a70cf268d28f00456a.pdf "Fast linear mixed models for genome-wide association studies"], Lippert 2011</ref>
* FAST-LMM-Select:<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3597090/ "Improved linear mixed models for genome-wide association studies"], Listgarten et al 2012</ref> like GCTA in using [[ridge regression]]<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3989144/ "Advantages and pitfalls in the application of mixed-model association methods"], Yang et al 2014</ref> but including [[feature selection]] to try to exclude irrelevant SNPs which only add noise to the relatedness estimates
* LMM-[[Lasso regression|Lasso]]<ref>[https://web.archive.org/web/20151204193223/http://bioinformatics.oxfordjournals.org/content/29/2/206.full#aff-1 "A lasso multi-marker mixed model for association mapping with population structure correction"], Rakitsch et al 2012</ref>
* GEMMA<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3386377/ "Genome-wide efficient mixed-model analysis for association studies"], Zhou & Stephens 2012</ref>
* EMMAX<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3092069/ "Variance component model to account for sample structure in genome-wide association studies"], Kang et al 2012</ref>
* [http://www.epcc.ed.ac.uk/projects-portfolio/reacta REACTA (formerly ACTA)] {{Webarchive|url=https://web.archive.org/web/20160523120255/http://www.epcc.ed.ac.uk/projects-portfolio/reacta |date=2016-05-23 }} claims order of magnitude runtime reductions<ref>[https://web.archive.org/web/20160522235202/http://bioinformatics.oxfordjournals.org/content/early/2012/09/27/bioinformatics.bts571.full.pdf "Advanced Complex Trait Analysis"], Gray et al 2012</ref><ref>[https://www.semanticscholar.org/paper/Regional-heritability-advanced-complex-trait-Cebamanos-Gray/c340835e1baf4b9fcafbfb001841bbd4793f598f/pdf "Regional Heritability Advanced Complex Trait Analysis for GPU and Traditional Parallel Architecture"], Cebamanos et al 2012</ref>
* [http://www.hsph.harvard.edu/alkes-price/software/ BOLT-REML]/BOLT-LMM<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342297/ "Efficient Bayesian mixed model analysis increases association power in large cohorts"], Loh et al 2012</ref> ([https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ manual] {{Webarchive|url=https://web.archive.org/web/20160611003139/https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ |date=2016-06-11 }}), faster & better scaling;<ref name="Loh2015">[http://biorxiv.org/content/biorxiv/early/2015/06/05/016527.full.pdf "Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis"], Loh et al 2015; see also [http://biorxiv.org/content/early/2015/06/05/016527 "Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis"], Loh et al 2015</ref> with potentially better efficiency in the meta-analysis scenario<ref>[http://biorxiv.org/content/early/2015/05/29/020115 "Mixed Models for Meta-Analysis and Sequencing"], Bulik-Sullivan 2015</ref>
* [http://scholar.harvard.edu/tge/software/megha MEGHA]<ref>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4345618/ "Massively expedited genome-wide heritability analysis (MEGHA)"], Ge et al 2015</ref>
* PLINK >1.9 (December 2013) supports [https://www.cog-genomics.org/plink2/ "the use of genetic relationship matrices in mixed model association analysis and other calculations"]
* LDAK:<ref>Speed et al 2016, [http://biorxiv.org/content/early/2017/01/15/074310 "Re-evaluation of SNP heritability in complex human traits"]</ref> loosens the GCTA assumption that all SNPs, regardless of genotyping quality or frequency, have same averaged expected effect, allowing for potentially finding much more SNP heritability
* GREML-IBD:<ref name="Evans2017B">Evans et al 2017, [http://www.biorxiv.org/content/early/2017/07/17/164848 "Narrow-sense heritability estimation of complex traits using identity-by-descent information."]</ref> GCTA for [[identity by descent]], attempting to estimate heritability based on shared genome segments in distant otherwise-unrelated relatives, in order to capture the effect of rarer variants which are not measured by SNP panels or otherwise imputed
 
== Traits ==
 
GCTA estimates frequently find estimates 0.1-0.5, consistent with broadsense heritability estimates (with the exception of personality traits, for which theory & current GWAS results suggest non-additive genetics driven by [[frequency-dependent selection]]<ref name="Verweij2012">[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518920/ "Maintenance of genetic variation in human personality: Testing evolutionary models by estimating heritability due to common causal variants and investigating the effect of distant inbreeding"], Verweij et al 2012</ref><ref>[http://www.unm.edu/~gfmiller/newpapers_sept6/penke%202007%20targetarticle.pdf "The Evolutionary Genetics of Personality"], Penke et al 2007; [http://www.larspenke.eu/pdfs/Penke_&_Jokela_in_press_-_Evolutionary_Genetics_of_Personality_Revisited.pdf "The Evolutionary Genetics of Personality Revisited"], Penke & Jokela 2016</ref>). Traits univariate GCTA has been used on (excluding SNP heritability estimates computed using other algorithms such as LD score regression, and bivariate GCTAs which are listed in [[genetic correlation]]) include (point-estimate format: "<math>h^2_{SNP}</math>([[standard error]])"):
 
== See also ==
Line 116 ⟶ 135:
 
* [http://espace.library.uq.edu.au/view/UQ:342517/UQ342517_OA.pdf "Research review: Polygenic methods and their application to psychiatric traits"], Wray et al. 2014
* [http://medicine.tums.ac.ir:803/Users/Javad_TavakoliBazzaz/Medical%20Genetics-2/Heritability%20in%20the%20genomics%20era.pdf "Heritability in the genomics era — concepts and misconceptions"] {{Webarchive|url=https://web.archive.org/web/20160522201032/http://medicine.tums.ac.ir:803/Users/Javad_TavakoliBazzaz/Medical%20Genetics-2/Heritability%20in%20the%20genomics%20era.pdf |date=2016-05-22 }}, Visscher et al. 2008
* [http://www.sciencedirect.com/science/article/pii/S2001037015000458 "Uncovering the Genetic Architectures of Quantitative Traits"], Lee et al. 2016
* [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12129/pdf "Estimating heritability using genomic data"], Stanton-Geddes et al. 2013
Line 135 ⟶ 154:
* [https://www.youtube.com/watch?v=b32OwqBPHkI "Genomics, Big Data, Medicine, and Complex Traits"] (Peter Visscher talk)
* [https://www.dropbox.com/s/1otmbu840xejjv1/MCTFR_talk.pdf "The Genetic Architectures of Psychological Traits"], Lee 2014 slides
* [https://www.youtube.com/watch?v=VI-5HlYQpNE "Heritability-based models for prediction of complex traits"], [https://sites.google.com/site/baldingstatisticalgenetics/home David Balding] {{Webarchive|url=https://web.archive.org/web/20161008184835/https://sites.google.com/site/baldingstatisticalgenetics/home |date=2016-10-08 }} 2015
 
[[Category:Behavioural genetics]]
Line 143 ⟶ 162:
[[Category:Twin studies]]
[[Category:Genetics studies]]
[[Category:Quantitative genetics]]
[[Category:Molecular genetics]]