Genome-wide complex trait analysis: Difference between revisions

Content deleted Content added
Undid revision 1132806771 by 51.155.207.129 (talk) - unexplained removal of validly sourced content
Expanded the interpretation and biases section. Removed most of the contrasts with twin studies and intelligence since these are unrelated to GCTA estimates and make this entry not self-contained.
Tag: references removed
Line 42:
 
== Interpretation ==
GCTA provides an unbiased estimate of the total variance in phenotype explained by all variants included in the relatedness matrix (and any variation correlated with those SNPs). This estimate can also be interpreted as the maximum prediction accuracy (R^2) that could be achieved from a linear predictor using all SNPs in the relatedness matrix. The latter interpretation is particularly relevant to the development of Polygenic Risk Scores, as it defines their maximum accuracy. GCTA estimates are sometimes misinterpreted as estimates of total (or narrow-sense, i.e. additive) heritability, but this is not a guarentee of the method. GCTA estimates are likewise sometimes misinterpreted as "lower bounds" on the narrow-sense heritability but this is also incorrect: first because GCTA estimates can be biased (including biased upwards) if the model assumptions are violated, and second because, by definition (and when model assumptions are met), GCTA can provide an unbiased estimate of the narrow-sense heritability if all causal variants are included in the relatedness matrix. The interpretation of the GCTA estimate in relation to the narrow-sense heritability thus depends on the variants used to construct the relatedness matrix.
GCTA estimates are often misinterpreted as "the total genetic contribution", and since they are often much less than the twin study estimates, the twin studies are presumed to be biased and the genetic contribution to a particular trait is minor. This is incorrect, as GCTA estimates are lower bounds.{{cn|date=February 2021}}
 
Most frequently, GCTA is run with a single relatedness matrix constructed from common SNPs and will not capture (or not fully capture) the contribution of the following factors:
A more correct interpretation would be that: GCTA estimates are the expected amount of variance that could be predicted by an indefinitely large GWAS using a simple additive linear model (without any interactions or higher-order effects) in a particular population at a particular time given the limited selection of SNPs and a trait measured with a particular amount of precision. Hence, there are many ways to exceed GCTA estimates:
 
# Any rare or low-frequency variants that are not directly genotyped/imputed.
# SNP genotyping data is typically limited to 200k-1m of the most common or scientifically interesting SNPs, though 150 million+ have been documented by genome sequencing;<ref>[http://biorxiv.org/content/early/2016/07/01/061663 "Deep Sequencing of 10,000 Human Genomes"], Telenti 2015</ref> as SNP prices drop and arrays become more comprehensive or whole-genome sequencing replaces SNP genotyping entirely, the expected narrowsense heritability will increase as more genetic variants are included in the analysis. The selection can also be expanded considerably using [[haplotype]]s<ref>[http://biorxiv.org/content/biorxiv/early/2015/07/12/022418.full.pdf "Haplotypes of common SNPs can explain missing heritability of complex diseases"], Bhatia et al 2015</ref> and [[Imputation (genetics)|imputation]] (SNPs can proxy for unobserved genetic variants which they tend to be inherited with); e.g. Yang et al. 2015<ref name="Yang2015">[http://www.gwern.net/docs/genetics/2015-yang.pdf "Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index"], Yang et al 2015</ref> finds that with more aggressive use of imputation to infer unobserved variants, the height GCTA estimate expands to 56% from 45%, and Hill et al. 2017 finds that expanding GCTA to cover rarer variants raises the intelligence estimates from ~30% to ~53% and explains all the heritability in their sample;<ref name="Hill2017">Hill et al 2017, [http://biorxiv.org/content/early/2017/02/06/106203 "Genomic analysis of family data reveals additional genetic effects on intelligence and personality"]</ref> for 4 traits in the UK Biobank, imputing raised the SNP heritability estimates.<ref name="Evans2017">Evans et al 2017, [http://biorxiv.org/content/early/2017/03/10/115527 "Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits"]</ref> Additional genetic variants include ''de novo'' [[mutations]]/[[mutation load]] & [[structural variation]]s such as [[copy-number variations]].
# Any non-linear, dominance, or epistatic genetic effects. Note that GCTA can be extended to estimate the contribution of these effects through more complex relatedness matrices.
# narrowsense heritability estimates assume simple additivity of effects, ignoring interactions. As some trait values will be due to these more complicated effects, the total genetic effect will exceed that of the subset measured by GCTA, and as the additive SNPs are found and measured, it will become possible to find interactions as well using more sophisticated statistical models.
# The effects of Gene-Environment interactions. Note that GCTA can be extended to estimate the contribution of GxE interactions when the E is known, by including additional variance components.
# all correlation & heritability estimates are biased downwards to zero by the presence of [[measurement error]]; the need for adjusting this leads to techniques such as [[Spearman's correction for measurement error]], as the underestimate can be quite severe for traits where large-scale and accurate measurement is difficult and expensive,<ref>[https://www.dropbox.com/s/s1yax9n9jgkpmb1/2004-hunterschmidt-methodsofmetaanalysis.pdf ''Methods of Meta-Analysis: Correcting Error and Bias in Research Findings''], Hunter & Schmidt 2004</ref> such as intelligence. For example, an intelligence GCTA estimate of 0.31, based on an intelligence measurement with [[test-retest reliability]] <math>r=0.65</math>, would after correction (<math>\frac{0.31}{0.65}</math>), be a true estimate of ~0.48, indicating that common SNPs alone explain half of variance. Hence, a GWAS with a better measurement of intelligence can expect to find more intelligence hits than indicated by a GCTA based on a noisier measurement.
# Structural variants, which are typically not genotyped or imputed.
# Measurement error: GCTA does not model any uncertainty or error on the measured trait.
 
GCTA makes several model assumptions and may produce biased estimates under the following conditions:
 
# The distribution of causal variants is systematically different from the distribution of variants included in the relatedness matrix (even if all causal variants are included in the relatedness matrix). For example, if causal variants are systematically at a higher/lower frequency or in higher/lower correlation than all genotyped variants. This can produce either an upwards or downwards bias depending on the relationship between the causal variants and variants used. Various extensions to GCTA have been proposed (for example, GREML-LDMS) to account for these distributional shifts.
# Population stratification is not fully accounted for by covariates. GCTA (specifically GREML) accounts for stratification through the inclusion of fixed effect covariates, typically principal components. If these covariates do not fully capture the stratification the GCTA estimate will be biased, generally upwards. Accounting for recent population structure is particularly challenging for studies of rare variants.
# Residual genetic or environmental relatedness present in the data. GCTA assumes a homogenous population with an independent and identically distributed environmental term. This assumption is violated if related individuals and/or individuals with substantially shared environments are included in the data. In this case, the GCTA estimate will additionally capture the contribution of any genetic variation correlated with the genetic relationship: either direct genetic effects or correlated environment.
# The presence of "indirect" genetic effects. When genetic variants present in the relatedness matrix are correlated with variants present in other individuals that influence the participant's environment, those effects will also be captured in the GCTA estimate. For example, if variants inherited by a participant from their mother influenced their phenotype through their maternal environment, then the effect of those variants will be included in the GCTA estimate even though it is "indirect" (i.e. mediated by parental genetics). This may be interpreted as an upward bias as such "indirect" effects are not strictly causal (altering them in the participant would not lead to a change in phenotype in expectation).
 
== Implementations ==