'''Human genetic clustering''' refers to a wide range of scientific and statistical methods often used to characterize patterns and subgroups within studies of [[human genetic variation]].
Clustering studies are thought to be valuable for characterizing the general structure of genetic variation among human populations, to bettercontribute understandto the study of ancestral origins, evolutionary history, and personalized medicine. Since the mapping of the human genome, and with the availability of increasingly powerful analytic tools, cluster analyses have revealed a range of ancestral and migratory trends among human populations and individuals.<ref name=":0">{{Cite journal|last=Novembre|first=John|last2=Ramachandran|first2=Sohini|date=2011-09-22|title=Perspectives on Human Population Structure at the Cusp of the Sequencing Era|url=http://dx.doi.org/10.1146/annurev-genom-090810-183123|journal=Annual Review of Genomics and Human Genetics|volume=12|issue=1|pages=245–274|doi=10.1146/annurev-genom-090810-183123|issn=1527-8204}}</ref> Humans tend to cluster together by geographic ancestry, with divisions between clusters aligning largely with geographic barriers such as oceans or mountain ranges. But the practice of defining clusters among modern human populations is largely arbitrary and variable, and there are no genetic markers that have been found to completely distinguish between groups.<ref>{{Cite journal|last=Bamshad|first=Michael J.|last2=Olson|first2=Steve E.|date=2003-12|title=Does Race Exist?|url=http://dx.doi.org/10.1038/scientificamerican1203-78|journal=Scientific American|volume=289|issue=6|pages=78–85|doi=10.1038/scientificamerican1203-78|issn=0036-8733}}</ref>
The practice of defining clusters of human populations is largely arbitrary and variable, depending on the sampled data, genetic markers, and statistical methods applied to their construction. Nevertheless, studiesStudies of human genetic clustering have been implicated in discussions of [[Race (human categorization)|race]], [[Ethnic group|ethnicity]], and [[scientific racism]], as some have controversially suggested that geneticgenetically derived clusters may representbe understood as genetically determined races.<ref>{{Cite journal|last=Jorde|first=Lynn B|last2=Wooding|first2=Stephen P|date=2004-10-26|title=Genetic variation, classification and 'race'|url=http://dx.doi.org/10.1038/ng1435|journal=Nature Genetics|volume=36|issue=S11|pages=S28–S33|doi=10.1038/ng1435|issn=1061-4036}}</ref><ref>{{Cite book|last=Verfasser.|first=Marks, Jonathan (Jonathan M.), 1955-|url=http://worldcat.org/oclc/1037867598|title=Is science racist?|isbn=978-0-7456-8925-8|oclc=1037867598}}</ref> Although cluster analyses invariably organize humans (or groups of humans) into subgroups, debate is ongoing on how to interpret these genetic clusters with respect to race and its social and phenotypic features. And, because there is such a small fraction of genetic variation between human genotypes overall, genetic clustering approaches are highly dependent on the sampled data, genetic markers, and statistical methods applied to their construction.
== Genetic clustering algorithms and methods ==
Since at least 2001, aA wide range of methods have been developed to assess the structure of human populations with the use of genetic data. MostEarly commonlystudies of within and between-group genetic variation used physical phenotypes and blood groups, with modern genetic clustersstudies canusing begenetic derivedmarkers bysuch analysisas ofrestriction site polymorphisms, short tandem repeat polymorphisms, and [[Single-nucleotide polymorphism|single nucleotide polymorphisms]] (SNPs), althoughamong otherothers.<ref>{{Cite geneticjournal|last=Bamshad|first=Michael|last2=Wooding|first2=Stephen|last3=Salisbury|first3=Benjamin dataA.|last4=Stephens|first4=J. canClaiborne|date=2004-08|title=Deconstructing bethe relationship between inputgenetics and analyzedrace|url=http://dx.doi.org/10.1038/nrg1401|journal=Nature asReviews wellGenetics|volume=5|issue=8|pages=598–609|doi=10.1038/nrg1401|issn=1471-0056}}</ref> Models for genetic clustering also vary by algorithms and programs used to process the data. Most methods for determining clusters can be categorized as '''model-based clustering methods''' or '''multidimensional summaries'''.<ref>{{Cite journal|last=Novembre|first=John|last2=Ramachandran|first2=Sohini|date=2011-09-22|title=Perspectives on Human Population Structure at the Cusp of the Sequencing Era|url=http://dx.doi.org/10.1146/annurev-genom-090810-183123|journal=Annual Review of Genomics and Human Genetics|volume=12|issue=1|pages=245–274|doi=10.1146/annurev-genom-090810-183123|issn=1527-8204}}</ref><ref name=":1">{{Cite journal|last=Lawson|first=Daniel John|last2=Falush|first2=Daniel|date=2012-09-22|title=Population Identification Using Genetic Data|url=http://dx.doi.org/10.1146/annurev-genom-082410-101510|journal=Annual Review of Genomics and Human Genetics|volume=13|issue=1|pages=337–361|doi=10.1146/annurev-genom-082410-101510|issn=1527-8204}}</ref> By processing a large number of SNPs (or other genetic marker data) in different ways, both approaches to genetic clustering tend to converge on similar patterns by identifying similarities among individual SNPs and/or [[haplotype]] tracts to reveal ancestral genetic similarities.<ref name=":1" />
=== Model-based clustering ===
Common model-based clustering algorithms include STRUCTURE, ADMIXTURE, and HAPMIX. These algorithms operate by finding the best fit for genetic data among an arbitrary or mathematically derived number of clusters, such that differences within clusters are minimized and differences between clusters are maximized. This clustering method is also referred to as "[[Genetic admixture|admixture]] inference," as individual genomes (or individuals within populations) can be characterized by the proportions of [[Allele|alleles]] linked to each cluster.<ref name=":0" /> Of note, algorithms like STRUCTURE have required that populations are chosen for samples before running the cluster analysis.
=== Multidimensional summary statistics ===
Where model-based clustering characterizes populations using proportions of discrete clusters, multidimensional summary statistics characterize populations on a continuous spectrum. The most common multidimensional statistical method used for genetic clustering is [[principal component analysis]] (PCA), which plots individuals by two or more axes (their "principal components") that represent aggregations of genetic markers that account for the highest variance. Clusters can then be identified by assessing the distribution of data; with larger samples of human genotypes, data tends to cluster in discrete groups andas withwell as admixed position between groups.<ref name=":0" /><ref name=":1" />
=== Caveats and drawbacks ===
There are many caveats and drawbacks to genetic clustering methods of any type, given the degree of admixture and relative similarity within the human population. All genetic cluster findings are [[Sampling bias|biased]] by the sampling process used to gather data, and by the quality and quantity of that data. ManyFor example, many clustering studies use data derived from populations that are geographically distinct and far apart from one another, which may present a falsean illusion of clearly discrete clusters where, in reality, populations are much more blended with one another when intermediary groups are included.<ref name=":0" /> STRUCTURE in particular canmay be misleading by requiring the data to be sorted into a predetermined number of clusters, which may or may not reflect the actual population's distribution.<ref name=":2">{{Cite journal|last=Kalinowski|first=S T|date=2010-08-04|title=The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure|url=http://dx.doi.org/10.1038/hdy.2010.95|journal=Heredity|volume=106|issue=4|pages=625–632|doi=10.1038/hdy.2010.95|issn=0018-067X}}</ref> Sample size also plays an important moderating role on cluster findings, as different sample size inputs can influence cluster assignment, and more subtle relationships between genotypes may only emerge with larger sample sizes.<ref name=":0" /><ref name=":2" />
== Applications to human genetic data ==
|