Microarray analysis techniques: Difference between revisions

Content deleted Content added
Merged from Gene chip analysis following unopposed 2017 proposal; see Talk:Microarray analysis techniques#Merger proposal
Line 1:
[[Image:Microarray2.gif|thumb|350px|Example of an approximately 40,000 probe spotted oligo microarray with enlarged inset to show detail.]]
'''Microarray analysis techniques''' are used in interpreting the data generated from experiments on DNA ('''Gene chip analysis'''), RNA, and protein [[microarray]]s, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire [[genome]] - in a single experiment.{{citation needed|date=February 2015}} Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.
 
==Introduction==
Line 23:
===Identification of significant differential expression===
Many strategies exist to identify array probes that show an unusual level of over-expression or under-expression. The simplest one is to call "significant" any probe that differs by an average of at least twofold between treatment groups. More sophisticated approaches are often related to [[t-test]]s or other mechanisms that take both effect size and variability into account. Curiously, the p-values associated with particular genes do not reproduce well between replicate experiments, and lists generated by straight fold change perform much better.<ref>{{cite journal |vauthors=Shi L, Reid LH, Jones WD, etal |title=The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements |journal=Nat. Biotechnol. |volume=24 |issue=9 |pages=1151–61 |year=2006 |pmid=16964229 |doi=10.1038/nbt1239 |pmc=3272078}}</ref><ref>{{cite journal |vauthors=Guo L, Lobenhofer EK, Wang C, etal |title=Rat toxicogenomic study reveals analytical consistency across microarray platforms |journal=Nat. Biotechnol. |volume=24 |issue=9 |pages=1162–9 |year=2006 |pmid=17061323 |doi=10.1038/nbt1238}}</ref> This represents an extremely important observation, since the point of performing experiments has to do with predicting general behavior. The MAQC group recommends using a fold change assessment plus a non-stringent p-value cutoff, further pointing out that changes in the background correction and scaling process have only a minimal impact on the rank order of fold change differences, but a substantial impact on p-values.
 
=== Clustering ===
Clustering is a data mining technique used to group genes having similar expression patterns. [[Hierarchical clustering]], and [[k-means clustering]] are widely used techniques in microarray analysis.
 
==== Hierarchical clustering ====
{{main|Hierarchical clustering}}
Hierarchical clustering is a statistical method for finding relatively [[Homogeneity and heterogeneity#Homogeneity|homogeneous]] clusters. Hierarchical clustering consists of two separate phases. Initially, a [[distance matrix]] containing all the pairwise distances between the genes is calculated. [[Pearson product-moment correlation coefficient|Pearson’s correlation]] and [[Spearman's rank correlation coefficient|Spearman’s correlation]] are often used as dissimilarity estimates, but other methods, like [[Taxicab geometry|Manhattan distance]] or [[Euclidean distance]], can also be applied. Given the number of distance measures available and their influence in the clustering algorithm results, several studies have compared and evaluated different distance measures for the clustering of microarray data, considering their intrinsic properties and robustness to noise.<ref name=Gentleman>{{cite book|last1=Gentleman|first1=Robert|title=Bioinformatics and computational biology solutions using R and Bioconductor|date=2005|publisher=Springer Science+Business Media|___location=New York|isbn=978-0-387-29362-2|display-authors=etal}}</ref><ref name=Jaskowiak2013>{{cite journal|last1=Jaskowiak|first1=Pablo A.|last2=Campello|first2=Ricardo J.G.B.|last3=Costa|first3=Ivan G.|title=Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis|journal=IEEE/ACM Transactions on Computational Biology and Bioinformatics|volume=10|issue=4|pages=845–857|doi=10.1109/TCBB.2013.9|year=2013}}</ref><ref name=Jaskowiak2014>{{cite journal|last1=Jaskowiak|first1=Pablo A|last2=Campello|first2=Ricardo JGB|last3=Costa|first3=Ivan G|title=On the selection of appropriate distances for gene expression data clustering|journal=BMC Bioinformatics|volume=15|issue=Suppl 2|pages=S2|doi=10.1186/1471-2105-15-S2-S2|pmid=24564555|pmc=4072854|year=2014}}</ref> After calculation of the initial distance matrix, the hierarchical clustering algorithm either (A) joins iteratively the two closest clusters starting from single data points (agglomerative, bottom-up approach, which is fairly more commonly used), or (B) partitions clusters iteratively starting from the complete set (divisive, top-down approach). After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. Hierarchical cluster analysis methods include:
 
*Single linkage (minimum method, nearest neighbor)
*Average linkage ([[UPGMA]]).
*Complete linkage (maximum method, furthest neighbor)
 
Different studies have already shown empirically that the Single linkage clustering algorithm produces poor results when employed to gene expression microarray data and thus should be avoided.<ref name="Jaskowiak2014" /><ref name="Souto2011" />
 
==== K-means clustering ====
{{main|k-means clustering}}
K-means clustering is an algorithm for grouping genes or samples based on pattern into ''K'' groups. Grouping is done by minimizing the sum of the squares of distances between the data and the corresponding cluster [[centroid]]. Thus the purpose of K-means clustering is to classify data based on similar expression.<ref>www.biostat.ucsf.edu</ref> K-means clustering algorithm and some of its variants (including [[k-medoids]]) have been shown to produce good results for gene expression data (at least better than hierarchical clustering methods). Empirical comparisons of [[k-means]], [[k-medoids]], hierarchical methods and, different distance measures can be found in the literature.<ref name="Jaskowiak2014" /><ref name=Souto2011>{{cite journal|last1=de Souto|first1=Marcilio C. P.|last2=Costa|first2=Ivan G.|last3=de Araujo|first3=Daniel S. A.|last4=Ludermir|first4=Teresa B.|last5=Schliep|first5=Alexander|title=Clustering cancer gene expression data: a comparative study|journal=BMC Bioinformatics|volume=9|issue=1|pages=497|doi=10.1186/1471-2105-9-497|year=2008}}</ref>
 
===Pattern recognition===
Line 117 ⟶ 134:
<ref name="R7">Zang, S., R. Guo, et al. (2007). "Integration of statistical inference methods and a novel control measure to improve sensitivity and specificity of data analysis in expression profiling studies." Journal of Biomedical Informatics 40(5): 552&ndash;560</ref>
}}
 
==External links==
* [https://web.archive.org/web/20130525084842/http://arrayexplorer.com/ ArrayExplorer - Compare microarray side by side to find the one that best suits your research needs]
Line 125 ⟶ 143:
* [https://doi.org/10.1016/B978-0-12-809633-8.20163-5 Comparative Transcriptomics Analysis] in [https://www.sciencedirect.com/science/referenceworks/9780128096338 Reference Module in Life Sciences]
* [https://web.archive.org/web/20090615060922/http://www-stat-class.stanford.edu/~tibs/clickwrap/sam.html SAM download instructions]
* [http://mmjggl.caltech.edu/microarray/data_analysis_fundamentals_manual.pdf GeneChip® Expression Analysis-Data Analysis Fundamentals] (by Affymetrix)
* [http://www.stat.duke.edu/~mw/ABS04/RefInfo/data_analysis_fundamentals_manual.pdf Duke data_analysis_fundamentals_manual]
 
[[Category:Microarrays]]