Principal component analysis: Difference between revisions

Content deleted Content added
m changed common redundancy "separate out" to "separate"
Tag: Reverted
m removed typo
 
(2 intermediate revisions by 2 users not shown)
Line 121:
<math>E(\alpha) = \frac{1}{1 - \alpha^2} \sum_{i=1}^{n} (\alpha p_i - q_i)^2</math>; such that setting the derivative of the error function to zero <math>(E'(\alpha) = 0)</math> yields:<math>\alpha = \frac{1}{2} \left( -\lambda \pm \sqrt{\lambda^2 + 4} \right)</math> where<math>\lambda = \frac{p \cdot p - q \cdot q}{p \cdot q}</math>.<ref name="Holmes2023" />
 
[[File:PCA of Haplogroup J using 37 STRs.png|thumb|right|A principal components analysis scatterplot of [[Y-STR]] [[haplotype]]s calculated from repeat-count values for 37 Y-chromosomal STR markers from 354 individuals.<br /> PCA has successfully found linear combinations of the markers that separate out different clusters corresponding to different lines of individuals' Y-chromosomal genetic descent.]]
Such [[dimensionality reduction]] can be a very useful step for visualising and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting ''L''&nbsp;=&nbsp;2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains [[Cluster analysis|clusters]] these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.
 
Similarly, in [[regression analysis]], the larger the number of [[explanatory variable]]s allowed, the greater is the chance of [[overfitting]] the model, producing conclusions that fail to generalise to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called [[principal component regression]].
Line 410:
<li>
'''Compute the cumulative energy content for each eigenvector'''
* The eigenvalues represent the distribution of the source data's energy{{Clarify|date=March 2011}} among each of the eigenvectors, where the eigenvectors form a [[basis (linear algebra)|basis]] for the data. The cumulative energy content ''g'' for the ''j''th eigenvector is the sum of the energy content across all of the eigenvalues from 1 through ''j'' divided by the sum of energy content across all eigenvalues (shown in step 8):{{Citation needed|date=March 2011}} <math display="block">g_j = \sum_{k=1}^j D_{kk} \qquad \text{for } j = 1,\dots,p </math>
 
* The eigenvalues represent the distribution of the source data's energy{{Clarify|date=March 2011}} among each of the eigenvectors, where the eigenvectors form a [[basis (linear algebra)|basis]] for the data. The cumulative energy content ''g'' for the ''j''th eigenvector is the sum of the energy content across all of the eigenvalues from 1 through ''j'':{{Citation needed|date=March 2011}} <math display="block">g_j = \sum_{k=1}^j D_{kk} \qquad \text{for } j = 1,\dots,p </math>
</li>
<li>
Line 540 ⟶ 539:
Market research has been an extensive user of PCA. It is used to develop customer satisfaction or customer loyalty scores for products, and with clustering, to develop market segments that may be targeted with advertising campaigns, in much the same way as factorial ecology will locate geographical areas with similar characteristics.<ref>{{Cite journal |last1=DeSarbo |first1=Wayne |last2=Hausmann |first2=Robert |last3=Kukitz |first3=Jeffrey |date=2007 |title=Restricted principal components analysis for marketing research |url=https://www.researchgate.net/publication/247623679 |journal=Journal of Marketing in Management |volume=2 |pages=305–328 |via=ResearchGate}}</ref>
 
PCA rapidly transforms large amounts of data into smaller, easier-to-digest variables that can be more rapidly and readily analyzed. In any consumer questionnaire, there are series of questions designed to elicit consumer attitudes, and principal components seek out latent variables underlying these attitudes. For example, the Oxford Internet Survey in 2013 asked 2000 people about their attitudes and beliefs, and from these analysts extracted four principal component dimensions, which they identified as 'escape', 'social networking', 'efficiency', and 'problem creating'.<ref>{{Cite book |last1=Dutton |first1=William H |url=http://oxis.oii.ox.ac.uk/wp-content/uploads/2014/11/OxIS-2013.pdf |title=Cultures of the Internet: The Internet in Britain |last2=Blank |first2=Grant |publisher=Oxford Internet Institute |year=2013 |pages=6}}</ref>
 
Another example from Joe Flood in 2008 extracted an attitudinal index toward housing from 28 attitude questions in a national survey of 2697 households in Australia. The first principal component represented a general attitude toward property and home ownership. The index, or the attitude questions it embodied, could be fed into a General Linear Model of tenure choice. The strongest determinant of private renting by far was the attitude index, rather than income, marital status or household type.<ref>{{Cite journal |last=Flood |first=Joe |date=2008 |title=Multinomial Analysis for Housing Careers Survey |url=https://www.academia.edu/33218811 |access-date=6 May 2022 |website=Paper to the European Network for Housing Research Conference, Dublin}}</ref>