Microarray analysis techniques: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
m Add: title, website. Converted bare reference to cite template. Removed parameters. You can use this bot yourself. Report bugs here. | User-activated.
Line 1:
{{Merge|Gene chip analysis|Significance analysis of microarrays|discuss=talk:Microarray analysis techniques#Merger proposal|date=May 2017}}
 
[[Image:Microarray2.gif|thumb|350px|Example of an approximately 40,000 probe spotted oligo microarray with enlarged inset to show detail.]]
'''Microarray analysis techniques''' are used in interpreting the data generated from experiments on DNA, RNA, and protein [[microarray]]s, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire [[genome]] - in a single experiment.{{citation needed|date=February 2015}} Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.
Line 32 ⟶ 30:
Specialized software tools for statistical analysis to determine the extent of over- or under-expression of a gene in a microarray experiment relative to a reference state have also been developed to aid in identifying genes or gene sets associated with particular [[phenotype]]s. One such method of analysis, known as [[Gene Set Enrichment]] Analysis (GSEA), uses a [[Kolmogorov-Smirnov]]-style statistic to identify groups of genes that are regulated together.<ref>{{cite journal |vauthors=Subramanian A, Tamayo P, Mootha VK, etal |title=Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles |journal=Proc. Natl. Acad. Sci. U.S.A. |volume=102 |issue=43 |pages=15545–50 |year=2005 |pmid=16199517 |doi=10.1073/pnas.0506580102 |pmc=1239896}}</ref> This third-party statistics package offers the user information on the genes or gene sets of interest, including links to entries in databases such as NCBI's [[GenBank]] and curated databases such as Biocarta<ref>{{cite web |url=http://www.biocarta.com/ |title=BioCarta - Charting Pathways of Life |accessdate=2007-12-31 |format= |website=}}</ref> and [[Gene Ontology]]. Protein complex enrichment analysis tool (COMPLEAT) provides similar enrichment analysis at the level of protein complexes.<ref>{{cite journal |vauthors=Vinayagam A, Hu Y, Kulkarni M, Roesel C, etal |title= Protein Complex-Based Analysis Framework for High-Throughput Data Sets. 6, rs5 (2013). |journal= Sci. Signal. |volume=6 |issue=r5 |year=2013 |pmid= 23443684 |doi= 10.1126/scisignal.2003629 |url= http://www.flyrnai.org/compleat/ |pages=rs5 |pmc=3756668}}</ref> The tool can identify the dynamic protein complex regulation under different condition or time points. Related system, PAINT<ref>{{cite web |url=http://www.dbi.tju.edu/dbi/staticpages.php?page=tools&menu=37 |title=DBI Web |accessdate=2007-12-31 |format= |website= |deadurl=yes |archiveurl=https://web.archive.org/web/20070705061522/http://www.dbi.tju.edu/dbi/staticpages.php?page=tools |archivedate=2007-07-05 |df= }}</ref> and SCOPE<ref>{{cite web |url=http://genie.dartmouth.edu/scope/ |title=SCOPE |accessdate=2007-12-31 |format= |website=}}</ref> performs a statistical analysis on gene promoter regions, identifying over and under representation of previously identified [[transcription factor]] response elements. Another statistical analysis tool is Rank Sum Statistics for Gene Set Collections (RssGsc), which uses rank sum probability distribution functions to find gene sets that explain experimental data.<ref>{{cite web |url=http://rssgsc.sourceforge.net/ |title=RssGsc |accessdate=2008-10-15 |format= |website=}}</ref> A further approach is contextual meta-analysis, i.e. finding out how a gene cluster responds to a variety of experimental contexts. [[Genevestigator]] is a public tool to perform contextual meta-analysis across contexts such as anatomical parts, stages of development, and response to diseases, chemicals, stresses, and [[neoplasms]].
 
===Significance analysis of microarrays (SAM)===
[[Image:SAM.png|thumb|right]]
{{main|Significance analysis of microarrays}}
'''Significance analysis of microarrays (SAM)''' is a [[statistics|statistical technique]], established in 2001 by Virginia Tusher, [[Robert Tibshirani]] and [[Gilbert Chu]], for determining whether changes in [[gene expression]] are statistically significant. ItWith wasthe establishedadvent of [[DNA microarray]]s, it is now possible to measure the expression of thousands of genes in 2001a bysingle Virginiahybridization Tusher,experiment. [[Robert Tibshirani]]The anddata [[Gilbertgenerated Chu]]is considerable, and a method for sorting out what is significant and what isn't is essential. SAM is distributed by [[Stanford University]] in an [[R (programming language)|R-package]] by [[Stanford University]].
 
SAM identifies statistically significant genes by carrying out gene specific [[Student's t-test|t-tests]] and computes a statistic ''d<sub>j</sub>'' for each gene ''j'', which measures the strength of the relationship between gene expression and a response variable.<ref name="R4R1">Chu, G., Narasimhan, B, Tibshirani, R, Tusher, V. "SAM "Significance Analysis of Microarrays" Users Guide and technical document." [http://www-stat.stanford.edu/~tibs/SAM/sam.pdf]</ref><ref name="R5R7"/><ref name="R6R8"><Zhang, S. (2007). "A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance." BMC Bioinformatics 8: 230.</ref> This analysis uses [[non-parametric statistics]], since the data may not follow a [[normal distribution]]. The response variable describes and groups the data based on experimental conditions. In this method, repeated [[permutations]] of the data are used to determine if the expression of any gene is significant related to the response. The use of permutation-based analysis accounts for correlations in genes and avoids [[wikt:Special:Search/parametric|parametric]] assumptions about the distribution of individual genes. This is an advantage over other techniques (e.g., [[ANOVA]] and [[Bonferroni correction]]), which assume equal variance and/or independence of genes.<ref name="R7R6"/>
 
===Basic protocol===
The following equation represents the algorithm used by SAM:
*Perform [[microarray]] experiments &mdash; DNA microarray with oligo and cDNA primers, SNP arrays, protein arrays, etc.
*Input Expression Analysis in Microsoft Excel &mdash; see below
*Run SAM as a Microsoft Excel Add-Ins
*Adjust the Delta tuning parameter to get a significant # of genes along with an acceptable false discovery rate (FDR)) and Assess Sample Size by calculating the mean difference in expression in the SAM Plot Controller
*List Differentially Expressed Genes (Positively and Negatively Expressed Genes)
 
===Running SAM===
<math> d_{i} = {r_{i} \over s_{i} + s_{o}}; i = 1, 2, ... p </math>
*SAM is available for download online at http://www-stat.stanford.edu/~tibs/SAM/ for academic and non-academic users after completion of a registration step.
*SAM is run as an Excel Add-In, and the SAM Plot Controller allows Customization of the False Discovery Rate and Delta, while the SAM Plot and SAM Output functionality generate a List of Significant Genes, Delta Table, and Assessment of Sample Sizes
*[[Permutations]] are calculated based on the number of samples
*Block Permutations
**Blocks are batches of microarrays; for example for eight samples split into two groups (control and affected) there are 4!=24 permutations for each block and the total number of permutations is (24)(24)= 576. A minimum of 1000 permutations are recommended;<ref name="R1"/><ref name="R2">Dinu, I. P., JD; Mueller, T; Liu, Q; Adewale, AJ; Jhangri, GS; Einecke, G; Famulski, KS; Halloran, P; Yasui, Y. (2007). "Improving gene set analysis of microarray data by SAM-GS." BMC Bioinformatics 8: 242.</ref><ref name="R3">Jeffery, I. H., DG; Culhane, AC. (2006). "Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data." BMC Bioinformatics 7: 359.</ref>
the number of permutations is set by the user when imputing correct values for the data set to run SAM
 
====Response formats====
where <math>r_{i}</math> is the linear regression coefficient of gene <math>i</math>, <math>s_{i}</math> is the standard error of <math>r_{i}</math>, and <math>s_{o}</math> is a constant chosen to minimize the coefficient of variation of <math>d_{i}</math>. <math>r_{i}</math> is equal to the expression levels (x) for gene i under y experimental conditions.
'''Types'''<ref name="R1"/>
:*'''Quantitative''' &mdash; real-valued (such as heart rate)
:*'''One class''' &mdash; tests whether the mean gene expression differs from zero
:*'''Two class''' &mdash; two sets of measurements
::*'''Unpaired''' &mdash; measurement units are different in the two groups; e.g. control and treatment groups with samples from different patients
::*'''Paired''' &mdash; same experimental units are measured in the two groups; e.g. samples before and after treatment from the same patients
:*'''Multiclass''' &mdash; more than two groups with each containing different experimental units; generalization of two class unpaired type
:*'''Survival''' &mdash; data of a time until an event (for example death or relapse)
:*'''Time course''' &mdash; each experimental units is measured at more than one time point; experimental units fall into a one or two class design
:*'''Pattern discovery''' &mdash; no explicit response parameter is specified; the user specifies eigengene (principal component) of the expression data and treats it as a quantitative response
 
===Algorithm===
SAM calculates a test statistic for relative difference in gene expression based on permutation analysis of expression data and calculates a false discovery rate. The principal calculations of the program are illustrated below.<ref name="R1"/><ref name="R7"/><ref name="R8"/>
 
[[Image:Samcalc.jpg]] [[Image:RandS.jpg]]
 
whereThe ''s''<mathsub>r_{i}o</mathsub> is the linear regression coefficient of gene <math>i</math>, <math>s_{i}</math>constant is the standard error of <math>r_{i}</math>, and <math>s_{o}</math> is a constant chosen to minimize the coefficient of variation of ''d<mathsub>d_{i}</mathsub>''. r<mathsub>r_{''i}''</mathsub> is equal to the expression levels (x) for gene ''i'' under y experimental conditions.
 
<math>\mathrm{False \ discovery \ rate \ (FDR) = \frac{Median \ (or \ 90^{th} \ percentile) \ of \ \# \ of \ falsely \ called \ genes}{Number \ of \ genes \ called \ significant}}</math>
 
'''Fold changes''' (t) are specified to guarantee genes called significant change at least a pre-specified amount. This means that the absolute value of the average expression levels of a gene under each of two conditions must be greater than the fold change (t) to be called positive and less than the inverse of the fold change (t) to be called negative.
 
The SAM algorithm can be stated as:
#Order test statistics according to magnitude <ref name="R7"/><ref name="R8"/>
#For each permutation compute the ordered null (unaffected) scores <ref name="R7"/><ref name="R8"/>
#Plot the ordered test statistic against the expected null scores <ref name="R7"/><ref name="R8"/>
#Call each gene significant if the absolute value of the test statistic for that gene minus the mean test statistic for that gene is greater than a stated threshold <ref name="R8"/>
#Estimate the false discovery rate based on expected versus observed values <ref name="R7"/><ref name="R8"/>
 
====Output====
*Significant gene sets
**Positive gene set &mdash; higher expression of most genes in the gene set correlates with higher values of the phenotype y
**Negative gene set &mdash; lower expression of most genes in the gene set correlates with higher values of the phenotype y
 
===SAM features===
*Data from Oligo or cDNA arrays, SNP array, protein arrays,etc. can be utilized in SAM<ref name="R7"/><ref name="R8"/>
*Correlates expression data to clinical parameters<ref name="R6"/>
*Correlates expression data with time<ref name="R1"/>
*Uses data permutation to estimates False Discovery Rate for multiple testing<ref name="R7"/><ref name="R8"/><ref name="R6"/><ref name="R5">Larsson, O. W., C; Timmons, JA. (2005). "Considerations when using the significance analysis of microarrays (SAM) algorithm." BMC Bioinformatics 6: 129.</ref>
*Reports local false discovery rate (the FDR for genes having a similar d<sub>i</sub> as that gene)<ref name="R1"/> and miss rates <ref name="R1"/><ref name="R7"/>
*Can work with blocked design for when treatments are applied within different batches of arrays<ref name="R1"/>
*Can adjust threshold determining number of gene called significant<ref name="R1"/>
 
==Error correction and quality control==
Line 63 ⟶ 114:
==References==
{{reflist|refs=
<ref name="R4R6">ChuTusher, V. G., Narasimhan, B,R. Tibshirani, R,et Tusher,al. V(2001). "SAM "Significance Analysisanalysis of Microarrays"microarrays Usersapplied Guideto andthe technicalionizing documentradiation response." Proceedings of the National Academy of Sciences 98(9): 5116&ndash;5121. [http://www-stat.stanford.edu/~tibs/SAM/sampnassam.pdf]</ref>
<ref name="R5R7">DinuZang, IS., PR. Guo, JD;et Mueller,al. T;(2007). Liu,"Integration Q;of Adewale,statistical AJ;inference Jhangri,methods GS;and Einecke,a G;novel Famulski,control KS;measure Halloran,to P;improve Yasui,sensitivity Y.and (2007).specificity "Improvingof gene setdata analysis ofin microarrayexpression dataprofiling by SAM-GSstudies." BMCJournal of BioinformaticsBiomedical 8Informatics 40(5): 242.552&ndash;560</ref>
<ref name="R6">Jeffery, I. H., DG; Culhane, AC. (2006). "Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data." BMC Bioinformatics 7: 359.</ref>
<ref name="R7">Tusher, V. G., R. Tibshirani, et al. (2001). "Significance analysis of microarrays applied to the ionizing radiation response." Proceedings of the National Academy of Sciences 98(9): 5116&ndash;5121. [http://www-stat.stanford.edu/~tibs/SAM/pnassam.pdf]</ref>
}}
==External links==
Line 75 ⟶ 124:
* [http://funrich.org/ FunRich - Perform gene set enrichment analysis] &mdash;software
* [https://doi.org/10.1016/B978-0-12-809633-8.20163-5 Comparative Transcriptomics Analysis] in [https://www.sciencedirect.com/science/referenceworks/9780128096338 Reference Module in Life Sciences]
* [https://web.archive.org/web/20090615060922/http://www-stat-class.stanford.edu/~tibs/clickwrap/sam.html SAM download instructions]
 
 
[[Category:Microarrays]]