Revision as of 18:47, 8 February 2010 edit Bfoteini (talk \| contribs) 70 edits →Algorithms ← Previous edit		Revision as of 19:00, 8 February 2010 edit undo Bfoteini (talk \| contribs) 70 edits No edit summary Next edit →
Line 17: === STREAM === STREAM is an algorithm for clustering data streams described by Guha, Mishra, Motwani and O'Callaghan <ref name=~~"CDS"~~cds > S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1927 Clustering Data Streams]. Proceedings of the Annual Symposium on Foundations of Computer Science, 2000</ref> which achieves a constant factor approximation for the k-Median problem in a single pass and using small space. '''''Theorem''': STREAM can solve the k-Median problem on a data stream with time O(n<sup>1+e</sup>) and space θ(n<sup>ε</sup>) up to a factor 2<sup>O(1/e)</sup>, where n the number of points and e<1/2''. Line 45: The actual STREAM algorithm works as follows: <ol> <li> Input the first ''O(M/k)'' points, where ''M'' denotes bounded memory and ''k'' number of centers. Using the randomized algorithm presented in <ref name=~~"CDS"~~cds /> cluster this to ''2k'' intermediate median points. <li> Use a local search algorithm to cluster'' O(M)'' intermediate medians of level ''i'' to ''2k'' medians of level ''i+1''. <li> Use the primal dual algorithm <ref> K. Jain and V. Vazirani. [http://portal.acm.org/citation.cfm?id=796509 Primal-dual approximation algorithms for metric facility ___location and k-median problems.] Proc. FOCS, 1999</ref> to cluster the ''O(M)'' medians to k medians. Line 53: Other well-known algorithms used for data stream clustering are: * [[BIRCH (data clustering) \| BIRCH]]<ref>T. Zhang, R. Ramakrishnan, M. Linvy. [http://doi.acm.org/10.1145/235968.233324 BIRCH: An Efficient Data Clustering Method for Very Large Databases], Proceedings of the ACM SIGMOD Conference on Management of Data, 1996</ref> : builds a hierarchical data structure to incrementally cluster the incoming points using the available memory and minimizing the amount of I/O required. The complexity of the algorithm is O(N) since one pass suffices to get a good clustering (though, results can be improved by allowing several passes). * COBWEB == References == <blockquote> <references> <references/> </blockquote> {{Reflist}} == Notes ==

Data stream clustering: Difference between revisions