Human genetic clustering: Difference between revisions

Content deleted Content added
Millager (talk | contribs)
clarifying note for my reviewers, to be deleted before finishing
Millager (talk | contribs)
updates prior to final draft
Line 1:
= Human genetic clustering =
'''Human genetic clustering''' refers to apatterns of relative genetic similarity among human individuals and populations, as well as the wide range of scientific and statistical methods often used to characterize patterns and subgroupsstudy withinthis studiesaspect of [[human genetic variation]].
 
Clustering studies are thought to be valuable for characterizing the general structure of genetic variation among human populations, to contribute to the study of ancestral origins, evolutionary history, and precision medicine. Since the mapping of the human genome, and with the availability of increasingly powerful analytic tools, [[Cluster analysis|cluster analyses]] have revealed a range of ancestral and migratory trends among human populations and individuals.<ref name=":0">{{Cite journal|last=Novembre|first=John|last2=Ramachandran|first2=Sohini|date=2011-09-22|title=Perspectives on Human Population Structure at the Cusp of the Sequencing Era|url=http://dx.doi.org/10.1146/annurev-genom-090810-183123|journal=Annual Review of Genomics and Human Genetics|volume=12|issue=1|pages=245–274|doi=10.1146/annurev-genom-090810-183123|issn=1527-8204}}</ref> HumansHuman genetic clusters tend to clusterbe togetherorganized by geographic ancestry, with divisions between clusters aligning largely with geographic barriers such as oceans or mountain ranges.<ref Butname=":3">{{Cite thejournal|last=Maglo|first=Koffi practiceN.|last2=Mersha|first2=Tesfaye ofB.|last3=Martin|first3=Lisa definingJ.|date=2016-02-17|title=Population clustersGenomics amongand modernthe humanStatistical populationsValues isof largelyRace: arbitraryAn andInterdisciplinary variable;Perspective althoughon individualthe geneticBiological markersClassification canof beHuman usedPopulations toand produceImplications smallerfor groups,Clinical thereGenetic areEpidemiological noResearch|url=http://dx.doi.org/10.3389/fgene.2016.00022|journal=Frontiers modelsin thatGenetics|volume=7|doi=10.3389/fgene.2016.00022|issn=1664-8021}}</ref><ref producename=":9" completely/> distinctClustering subgroupsstudies whenhave largerbeen numbersapplied ofto geneticglobal markers are used.populations,<ref name=":510" /> as well as to population subsets like post-colonial North America.<ref>{{Cite journal|last=BamshadHan|first=Michael J.Eunjung|last2=OlsonCarbonetto|first2=StevePeter|last3=Curtis|first3=Ross E.|last4=Wang|first4=Yong|last5=Granka|first5=Julie M.|last6=Byrnes|first6=Jake|last7=Noto|first7=Keith|last8=Kermany|first8=Amir R.|last9=Myres|first9=Natalie M.|last10=Barber|first10=Mathew J.|last11=Rand|first11=Kristin A.|date=20032017-1202-07|title=DoesClustering Raceof Exist?770,000 genomes reveals post-colonial population structure of North America|url=httphttps://dxwww.doinature.orgcom/10.1038articles/scientificamerican1203-78ncomms14238|journal=ScientificNature AmericanCommunications|language=en|volume=2898|issue=61|pages=78–8514238|doi=10.1038/scientificamerican1203-78ncomms14238|issn=00362041-87331723}}</ref><ref name=":3">{{Cite journal|last=MagloJordan|first=KoffiI. N.King|last2=MershaRishishwar|first2=Tesfaye B.Lavanya|last3=MartinConley|first3=LisaAndrew JB.|date=20162019-02-1709|title=PopulationNative GenomicsAmerican admixture recapitulates population-specific migration and settlement of the Statisticalcontinental ValuesUnited States|url=https://pubmed.ncbi.nlm.nih.gov/31545791/|journal=PLoS genetics|volume=15|issue=9|pages=e1008225|doi=10.1371/journal.pgen.1008225|issn=1553-7404|pmc=6756731|pmid=31545791}}</ref> Notably, the practice of Race:defining Anclusters Interdisciplinaryamong Perspectivemodern onhuman thepopulations Biologicalis Classificationlargely arbitrary and variable; although individual genetic markers can be used to produce smaller groups, there are no models that produce completely distinct subgroups when larger numbers of Humangenetic Populationsmarkers andare Implicationsused.<ref forname=":3" Clinical/><ref Geneticname=":5">{{Cite Epidemiologicaljournal|last=Bamshad|first=Michael ResearchJ.|last2=Olson|first2=Steve E.|date=2003-12|title=Does Race Exist?|url=http://dx.doi.org/10.33891038/fgene.2016.00022scientificamerican1203-78|journal=FrontiersScientific in GeneticsAmerican|volume=7289|issue=6|pages=78–85|doi=10.33891038/fgene.2016.00022scientificamerican1203-78|issn=16640036-80218733}}</ref><ref name=":2" />
 
StudiesMany studies of human genetic clustering have been implicated in discussions of [[Race (human categorization)|race]], [[Ethnic group|ethnicity]], and [[scientific racism]], as some have controversially suggested that genetically derived clusters may be understood as proof of genetically determined races.<ref name=":4">{{Cite journal|last=Jorde|first=Lynn B|last2=Wooding|first2=Stephen P|date=2004-10-26|title=Genetic variation, classification and 'race'|url=http://dx.doi.org/10.1038/ng1435|journal=Nature Genetics|volume=36|issue=S11|pages=S28–S33|doi=10.1038/ng1435|issn=1061-4036}}</ref><ref>{{Cite book|last=Verfasser.|first=Marks, Jonathan (Jonathan M.), 1955-|url=http://worldcat.org/oclc/1037867598|title=Is science racist?|isbn=978-0-7456-8925-8|oclc=1037867598}}</ref> Although cluster analyses invariably organize humans (or groups of humans) into subgroups, debate is ongoing on how to interpret these genetic clusters with respect to race and its social and phenotypic features. And, because there is such a small fraction of genetic variation between human genotypes overall, genetic clustering approaches are highly dependent on the sampled data, genetic markers, and statistical methods applied to their construction.
 
== Genetic clustering algorithms and methods ==
A wide range of methods have been developed to assess the structure of human populations with the use of genetic data. Early studies of within and between-group genetic variation used physical phenotypes and blood groups, with modern genetic studies using genetic markers such as [[Alu element|Alu sequences]], [[Microsatellite|short tandem repeat polymorphisms]], and [[Single-nucleotide polymorphism|single nucleotide polymorphisms]] (SNPs), among others.<ref>{{Cite journal|last=Bamshad|first=Michael|last2=Wooding|first2=Stephen|last3=Salisbury|first3=Benjamin A.|last4=Stephens|first4=J. Claiborne|date=2004-08|title=Deconstructing the relationship between genetics and race|url=http://dx.doi.org/10.1038/nrg1401|journal=Nature Reviews Genetics|volume=5|issue=8|pages=598–609|doi=10.1038/nrg1401|issn=1471-0056}}</ref> Models for genetic clustering also vary by algorithms and programs used to process the data. Most methods for determining clusters can be categorized as '''model-based clustering methods''' (such as the algorithm STRUCTURE) or '''multidimensional summaries''' (often through prinicipalprincipal component analysis).<ref name=":0" /><ref name=":1">{{Cite journal|last=Lawson|first=Daniel John|last2=Falush|first2=Daniel|date=2012-09-22|title=Population Identification Using Genetic Data|url=http://dx.doi.org/10.1146/annurev-genom-082410-101510|journal=Annual Review of Genomics and Human Genetics|volume=13|issue=1|pages=337–361|doi=10.1146/annurev-genom-082410-101510|issn=1527-8204}}</ref> By processing a large number of SNPs (or other genetic marker data) in different ways, both approaches to genetic clustering tend to converge on similar patterns by identifying similarities among SNPs and/or [[haplotype]] tracts to reveal ancestral genetic similarities.<ref name=":1" />
 
=== Model-based clustering ===
[[File:Rosenberg 1048people 993markers.jpg|thumb|Human population structure has been inferred from multilocus DNA sequence data (Rosenberg et al. 2002, 2005). Individuals from 52 populations were examined at 993 DNA markers. This data was used to partition individuals into K = 2, 3, 4, 5, or 6 gene clusters. In this figure, the average fractional membership of individuals from each population is represented by horizontal bars partitioned into K colored segments.]]
Common model-based clustering algorithms include STRUCTURE, ADMIXTURE, and HAPMIX. These algorithms operate by finding the best fit for genetic data among an arbitrary or mathematically derived number of clusters, such that differences within clusters are minimized and differences between clusters are maximized. This clustering method is also referred to as "[[Genetic admixture|admixture]] inference," as individual genomes (or individuals within populations) can be characterized by the proportions of [[Allele|alleles]] linked to each cluster.<ref name=":0" /> OfIn noteother words, algorithms like STRUCTURE have requiredassume that discrete ancestral populations, which are chosenoperationalized forthrough samplesunique beforegenetic markers, have combined over time to form the admixed populations runningof the clustermodern analysisday.
 
=== Multidimensional summary statistics ===
Where model-based clustering characterizes populations using proportions of discrete ancestral clusters, multidimensional summary statistics characterize populations on a continuous spectrum. The most common multidimensional statistical method used for genetic clustering is [[principal component analysis]] (PCA), which plots individuals by two or more axes (their "principal components") that represent aggregations of genetic markers that account for the highest variance. Clusters can then be identified by assessing the distribution of data; with larger samples of human genotypes, data tends to cluster in discretedistinct groups as well as admixed positionpositions between groups.<ref name=":0" /><ref name=":1" />
 
=== Caveats and limitations ===
Line 20:
 
== Notable applications to human genetic data ==
Modern applications of genetic clustering methods to aglobal-scale largegenetic dataset derived from human populationsdata waswere first marked by studies associated with the [[Human Genome Diversity Project]] (HGDP) data.<ref name=":0" /> These early HGDP studies, such as those by Rosenberg et al. (2002),<ref name=":10">{{Cite journal|last=Rosenberg|first=N. A.|date=2002-12-20|title=Genetic Structure of Human Populations|url=http://dx.doi.org/10.1126/science.1078311|journal=Science|volume=298|issue=5602|pages=2381–2385|doi=10.1126/science.1078311|issn=0036-8075}}</ref><ref>{{Cite journal|last=Rosenberg|first=Noah A|last2=Mahajan|first2=Saurabh|last3=Ramachandran|first3=Sohini|last4=Zhao|first4=Chengfeng|last5=Pritchard|first5=Jonathan K|last6=Feldman|first6=Marcus W|date=2005-12-09|title=Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure|url=http://dx.doi.org/10.1371/journal.pgen.0010070|journal=PLoS Genetics|volume=1|issue=6|pages=e70|doi=10.1371/journal.pgen.0010070|issn=1553-7404}}</ref> contributed to theories of the serial founder effect and early human migration out of Africa.
'''[to be 100% transparent, this sub-section (and the accompanying figure) is mostly lifted from an older, defunct version of this article. Everything else in the article is completely original work]'''
 
A number of landmark genetic cluster studies have been conducted on global human populations since 2002, including the following:
Modern applications of genetic clustering methods to a large dataset derived from human populations was first marked by studies associated with the [[Human Genome Diversity Project]] (HGDP) data.<ref name=":0" /> These early HGDP studies, such as those by Rosenberg et al. (2002),<ref>{{Cite journal|last=Rosenberg|first=N. A.|date=2002-12-20|title=Genetic Structure of Human Populations|url=http://dx.doi.org/10.1126/science.1078311|journal=Science|volume=298|issue=5602|pages=2381–2385|doi=10.1126/science.1078311|issn=0036-8075}}</ref><ref>{{Cite journal|last=Rosenberg|first=Noah A|last2=Mahajan|first2=Saurabh|last3=Ramachandran|first3=Sohini|last4=Zhao|first4=Chengfeng|last5=Pritchard|first5=Jonathan K|last6=Feldman|first6=Marcus W|date=2005-12-09|title=Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure|url=http://dx.doi.org/10.1371/journal.pgen.0010070|journal=PLoS Genetics|volume=1|issue=6|pages=e70|doi=10.1371/journal.pgen.0010070|issn=1553-7404}}</ref> contributed to theories of the serial founder effect and early human migration out of Africa.
 
A number of landmark genetic cluster studies have been conducted since 2002, including the following:
{| class="wikitable"
|Authors
Line 78 ⟶ 76:
|250,000 SNPs
|}
Clustering methods have also been notably applied to population subset studies. For example, #####
 
== Genetic clustering and race ==
Line 84 ⟶ 83:
Many other scholars have challenged the idea that race can be inferred by genetic clusters, drawing distinctions between arbitrarily assigned genetic clusters, ancestry, and race. One recurring caution against thinking of human populations in terms of clusters is the notion that genotypic variation and traits are distributed evenly between populations, along gradual [[Cline (biology)|clines]] rather than along discrete population boundaries; so although genetic similarities are usually organized geographically, their underlying populations have never been completely separated from one another. And due to migration, gene flow, and baseline homogeneity, features between groups are extensively overlapping and intermixed.<ref name=":3" /><ref name=":4" /> Moreover, genetic clusters do not typically match socially defined racial groups; many commonly understood races may not be sorted into the same genetic cluster, and many genetic clusters are made up of individuals who would have distinct racial identities.<ref name=":5" /> In general, clusters may most simply be understood as products of the methods used to sample and analyze genetic data; not without meaning for understanding ancestry and genetic characteristics, but inadequate to fully explaining the concept of race, which is more often described in terms of social and cultural forces.
 
In the related context of [[personalized medicine]], race is currently listed as a [[risk factor]] for a wide range of medical conditions with genetic and non-genetic causes. Questions have emerged regarding whether or not genetic clusters support the idea of race as a valid construct to apply to medical research and treatment of disease, because there are many diseases that correspond with specific genetic markers and/or with specific populations, as seen with [[Tay–Sachs disease|Tay-Sachs disease]] or [[sickle cell disease]].<ref name=":9">{{Cite journal|date=2012-10-29|editor-last=Goodman|editor-first=Alan H.|editor2-last=Moses|editor2-first=Yolanda T.|editor3-last=Jones|editor3-first=Joseph L.|title=Race|url=http://dx.doi.org/10.1002/9781118233023|doi=10.1002/9781118233023}}</ref><ref name=":6" /> Researchers are careful to emphasize that ancestry--revealed in part through cluster analyses--plays an important role in understanding risk of disease. But racial or ethnic identity does not perfectly align with genetic ancestry, and so race and ethnicity do not reveal enough information to make a medical diagnosis.<ref name=":6" /> Race as a variable in medicine is more likely to reflect social circumstances, where ancestry information is more likely to be meaningful when considering genetic ancestry.<ref name=":3" /><ref name=":6" />
 
<references />