Data stream clustering: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Line 22:
STREAM is an algorithm for clustering data streams described by Guha, Mishra, Motwani and O'Callaghan <ref name=cds > S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1927 Clustering Data Streams]. Proceedings of the Annual Symposium on Foundations of Computer Science, 2000</ref> which achieves a constant factor approximation for the k-Median problem in a single pass and using small space.
 
'''''Theorem''': STREAM can solve the ''k''-Median problem on a data stream in a single pass, with time ''O(n<sup>1+e</sup>)'' and space ''θ(n<sup>ε</sup>)'' up to a factor ''2<sup>O(1/e)</sup>'', where ''n'' the number of points and ''e<1/2''.
 
To understand STREAM, the first step is to show that clustering can take place in small space (not caring about the number of passes). Small-Space is a [[divide-and-conquer algorithm]] that divides the data, ''S'', into <math>\ell</math> pieces, clusters each one of them (using ''k''-means) and then clusters the centers obtained.
Line 32:
 
<ol>
<li> Divide ''S'' into ''l''<math>\ell</math> disjoint pieces ''X''<sub>1</sub>,...,''X''<sub><math>_{\ell}</math></sub>''.</li>
<li> For each ''i'', find ''O(k)'' centers in ''X<sub>i</sub>''. Assign
each point in ''X<sub>i</sub>'' to its closest center.</li>