Microarray analysis techniques: Difference between revisions

Content deleted Content added
m Reverted edits by 217.149.179.66 (talk) (AV)
Citation bot (talk | contribs)
Add: doi-access. | Use this bot. Report bugs. | Suggested by Abductive | #UCB_webform 1534/3844
Line 14:
Comparing two different arrays or two different samples hybridized to the same array generally involves making adjustments for systematic errors introduced by differences in procedures and dye intensity effects. Dye normalization for two color arrays is often achieved by [[local regression]]. LIMMA provides a set of tools for background correction and scaling, as well as an option to average on-slide duplicate spots.<ref>{{cite web |url=http://bioinf.wehi.edu.au/limma/ |title=LIMMA Library: Linear Models for Microarray Data |access-date=2008-01-01 }}</ref> A common method for evaluating how well normalized an array is, is to plot an [[MA plot]] of the data. MA plots can be produced using programs and languages such as R, MATLAB, and Excel.{{cn|date=March 2023}}
 
Raw Affy data contains about twenty probes for the same RNA target. Half of these are "mismatch spots", which do not precisely match the target sequence. These can theoretically measure the amount of nonspecific binding for a given target. Robust Multi-array Average (RMA) <ref>{{cite journal|last=Irizarry|first=RA|author2=Hobbs, B |author3=Collin, F |author4=Beazer-Barclay, YD |author5=Antonellis, KJ |author6=Scherf, U |author7= Speed, TP |title=Exploration, normalization, and summaries of high density oligonucleotide array probe level data.|journal=Biostatistics|volume=4|issue=2|pages=249–64|year=2003|pmid=12925520 |doi=10.1093/biostatistics/4.2.249|doi-access=free}}</ref> is a normalization approach that does not take advantage of these mismatch spots, but still must summarize the perfect matches through [[median polish]].<ref>{{cite journal |vauthors=Bolstad BM, Irizarry RA, Astrand M, Speed TP |title=A comparison of normalization methods for high density oligonucleotide array data based on variance and bias |journal=Bioinformatics |volume=19 |issue=2 |pages=185–93 |year=2003 |pmid=12538238 |doi=10.1093/bioinformatics/19.2.185|doi-access=free }}</ref> The median polish algorithm, although robust, behaves differently depending on the number of samples analyzed.<ref>{{cite journal |vauthors=Giorgi FM, Bolger AM, Lohse M, Usadel B |title=Algorithm-driven Artifacts in median polish summarization of Microarray data |journal=BMC Bioinformatics |volume=11 |pages=553 |year=2010 |pmid=21070630 |doi=10.1186/1471-2105-11-553 |pmc=2998528 |doi-access=free }}</ref> Quantile normalization, also part of RMA, is one sensible approach to normalize a batch of arrays in order to make further comparisons meaningful.
 
The current Affymetrix MAS5 algorithm, which uses both perfect match and mismatch probes, continues to enjoy popularity and do well in head to head tests.<ref>{{cite journal |vauthors=Lim WK, Wang K, Lefebvre C, Califano A |title=Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks |journal=Bioinformatics |volume=23 |issue=13 |pages=i282–8 |year=2007 |pmid=17646307 |doi=10.1093/bioinformatics/btm201|doi-access=free }}</ref>
Line 29:
==== Hierarchical clustering ====
{{main|Hierarchical clustering}}
Hierarchical clustering is a statistical method for finding relatively [[Homogeneity and heterogeneity#Homogeneity|homogeneous]] clusters. Hierarchical clustering consists of two separate phases. Initially, a [[distance matrix]] containing all the pairwise distances between the genes is calculated. [[Pearson product-moment correlation coefficient|Pearson's correlation]] and [[Spearman's rank correlation coefficient|Spearman's correlation]] are often used as dissimilarity estimates, but other methods, like [[Taxicab geometry|Manhattan distance]] or [[Euclidean distance]], can also be applied. Given the number of distance measures available and their influence in the clustering algorithm results, several studies have compared and evaluated different distance measures for the clustering of microarray data, considering their intrinsic properties and robustness to noise.<ref name=Gentleman>{{cite book|last1=Gentleman|first1=Robert|title=Bioinformatics and computational biology solutions using R and Bioconductor|date=2005|publisher=Springer Science+Business Media|___location=New York|isbn=978-0-387-29362-2|display-authors=etal}}</ref><ref name=Jaskowiak2013>{{cite journal|last1=Jaskowiak|first1=Pablo A.|last2=Campello|first2=Ricardo J.G.B.|last3=Costa|first3=Ivan G.|title=Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis|journal=IEEE/ACM Transactions on Computational Biology and Bioinformatics|volume=10|issue=4|pages=845–857|doi=10.1109/TCBB.2013.9|pmid=24334380|year=2013|s2cid=760277}}</ref><ref name=Jaskowiak2014>{{cite journal|last1=Jaskowiak|first1=Pablo A|last2=Campello|first2=Ricardo JGB|last3=Costa|first3=Ivan G|title=On the selection of appropriate distances for gene expression data clustering|journal=BMC Bioinformatics|volume=15|issue=Suppl 2|pages=S2|doi=10.1186/1471-2105-15-S2-S2|pmid=24564555|pmc=4072854|year=2014 |doi-access=free }}</ref> After calculation of the initial distance matrix, the hierarchical clustering algorithm either (A) joins iteratively the two closest clusters starting from single data points (agglomerative, bottom-up approach, which is fairly more commonly used), or (B) partitions clusters iteratively starting from the complete set (divisive, top-down approach). After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. Hierarchical cluster analysis methods include:
* Single linkage (minimum method, nearest neighbor)
* Average linkage ([[UPGMA]]).
Line 38:
==== K-means clustering ====
{{main|k-means clustering}}
K-means clustering is an algorithm for grouping genes or samples based on pattern into ''K'' groups. Grouping is done by minimizing the sum of the squares of distances between the data and the corresponding cluster [[centroid]]. Thus the purpose of K-means clustering is to classify data based on similar expression.<ref>{{cite web |url=http://www.biostat.ucsf.edu/ |title=Home |website=biostat.ucsf.edu}}</ref> K-means clustering algorithm and some of its variants (including [[k-medoids]]) have been shown to produce good results for gene expression data (at least better than hierarchical clustering methods). Empirical comparisons of [[k-means]], [[k-medoids]], hierarchical methods and, different distance measures can be found in the literature.<ref name="Jaskowiak2014" /><ref name=Souto2011>{{cite journal|last1=de Souto|first1=Marcilio C. P.|last2=Costa|first2=Ivan G.|last3=de Araujo|first3=Daniel S. A.|last4=Ludermir|first4=Teresa B.|last5=Schliep|first5=Alexander|title=Clustering cancer gene expression data: a comparative study|journal=BMC Bioinformatics|volume=9|issue=1|pages=497|doi=10.1186/1471-2105-9-497|pmid=19038021|pmc=2632677|year=2008 |doi-access=free }}</ref>
 
===Pattern recognition===
Line 64:
* [[Permutations]] are calculated based on the number of samples
* Block Permutations
** Blocks are batches of microarrays; for example for eight samples split into two groups (control and affected) there are 4!=24 permutations for each block and the total number of permutations is (24)(24)= 576. A minimum of 1000 permutations are recommended;<ref name="R1"/><ref name="R2">{{cite journal | last1 = Dinu | first1 = I. P. | last2 = JD | last3 = Mueller | first3 = T | last4 = Liu | first4 = Q | last5 = Adewale | first5 = AJ | last6 = Jhangri | first6 = GS | last7 = Einecke | first7 = G | last8 = Famulski | first8 = KS | last9 = Halloran | first9 = P | last10 = Yasui | first10 = Y. | year = 2007 | title = Improving gene set analysis of microarray data by SAM-GS. | journal = BMC Bioinformatics | volume = 8 | page = 242 | doi=10.1186/1471-2105-8-242| pmid = 17612399 | pmc = 1931607 | doi-access = free }}</ref><ref name="R3">{{cite journal | last1 = Jeffery | first1 = I. H. | last2 = DG | last3 = Culhane | first3 = AC. | year = 2006 | title = Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data | journal = BMC Bioinformatics | volume = 7 | page = 359 | doi=10.1186/1471-2105-7-359| pmid = 16872483 | pmc = 1544358 | doi-access = free }}</ref>
the number of permutations is set by the user when imputing correct values for the data set to run SAM
 
Line 106:
* Correlates expression data to clinical parameters<ref name="R6"/>
* Correlates expression data with time<ref name="R1"/>
* Uses data permutation to estimates False Discovery Rate for multiple testing<ref name="R7"/><ref name="R8"/><ref name="R6"/><ref name="R5">{{cite journal | last1 = Larsson | first1 = O. W. C | last2 = Timmons | first2 = JA. | year = 2005 | title = Considerations when using the significance analysis of microarrays (SAM) algorithm | journal = BMC Bioinformatics | volume = 6 | page = 129 | doi = 10.1186/1471-2105-6-129 | pmid = 15921534 | pmc = 1173086 | doi-access = free }}</ref>
* Reports local false discovery rate (the FDR for genes having a similar d<sub>i</sub> as that gene)<ref name="R1"/> and miss rates <ref name="R1"/><ref name="R7"/>
* Can work with blocked design for when treatments are applied within different batches of arrays<ref name="R1"/>