Content deleted Content added
No edit summary |
No edit summary |
||
Line 1:
In [[computer science]], data stream [[
<!-- in contrary to the traditional clustering where data are static. -->
== History ==
▲amount of streaming data such as network flows, sensor data, and web click streams. One of the first results on data streams was due to Munro and Paterson <ref>J.Munro and M. Paterson. [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4567985 Selection and Sorting with Limited Storage]. ''Theoretical Computer Science'', pages 315-323, 1980</ref> but the model was formalized much later by Henzinger, Raghavan, and Rajagopalan <ref>M. Henzinger, P. Raghavan, and S. Rajagopalan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.9554 Computing on Data Streams]. ''Digital Equipment Corporation, TR-1998-011'', August 1998.</ref>. [[k-means clustering | K-means]] is a widely used heuristic for clustering but also alternate algorithms for clustering have been developed such as [[k-Medoids]], [[CURE data clustering algorithm | CURE]] and the popular [[BIRCH(data clustering) | BIRCH]].
== Definition ==
The problem of data stream clustering is defined as:
Input: a sequence of <math>n</math> points in metric space and an integer <math>k</math>.<br />
Output: <math>k</math> centers in the set of the <math>n</math> points so as to minimize the sum of distances from data points to their closest cluster centers.
== Algorithms ==
▲These algorithms have many similarities with [[online algorithms]] but they are not identical. Unlike online algorithms, algorithms for data stream clustering have only a bounded amount of memory available and they may be able to take action after a group of points arrives while online algorithms are required to take action after each point arrives.
=== STREAM ===
Since data stream algorithms have limited memory available, the first goal is to show that clustering can take place in small space (not caring about the number of passes). Small-Space<ref>S. Guha, A. Meyerson, N. Mishra, R. Motwani. ''[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1198387&tag=1 Clustering Data Streams: Theory and Practice]''. IEEE Transactions on Knowledge and Data Engineering, Volume 15, Issue 3, 2003 </ref> is a [[divide-and-conquer algorithm]] that divides the data into pieces, clusters each one of them (using k-means) and then clusters the centers obtained (where each center is weighted by the number of points assigned to it).▼
STREAM is an algorithm for clustering data streams described by Guha, Mishra, Motwani and O'Callaghan <ref>S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1927 Clustering Data Streams]. Proceedings of the Annual Symposium on Foundations of Computer Science, 2000</ref> which achieves a constant factor approximation
'''''Theorem''': STREAM can solve the k-Median problem on a data stream with time <math>O(n^{1+\epsilon })</math> and space <math>\theta(n^{\epsilon})</math> up to a factor <math>2^{O({1/ \epsilon})}</math>, where <math>n</math> the number of points and <math>\epsilon < 1/2</math>''.
▲
<blockquote>
Algorithm Small-Space(S)
1. Divide ''S'' into ''l'' disjoint pieces <math>\chi_1, \dots, \chi_l</math>.
2. For each ''i'', find ''O(k)'' centers in <math>\chi_i</math>, using ''k''-means. Assign
each point in <math>\chi_i</math> to its closest center.
3. Let <math>\chi'</math> be the ''O(lk)'' centers obtained in (2),
where each center ''c'' is weighted by the number
of points assigned to it.
4. Cluster <math>\chi'</math> to find ''k'' centers.
</blockquote>
[[File:Small-Space.jpg | thumb | 500x230px | center | Small-Space Algorithm representation]]
Line 21 ⟶ 40:
The approximation factor of this algorithm is <math>2c(1+2b)+2b</math> and if we generalize it so that it recursively calls itself, <math>i</math> times on a successively smaller set of weighted centers then it gives a constant factor approximation to the k-median problem, which, as expected, worsens with each successive reclustering.
▲STREAM is an algorithm for clustering data streams described by Guha, Mishra, Motwani and O'Callaghan <ref>S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1927 Clustering Data Streams]. Proceedings of the Annual Symposium on Foundations of Computer Science, 2000</ref> which achieves a constant factor approximation algorithm for the k-Median problem in a single pass.
Other well-known algorithms used for data stream clustering are:
Line 40 ⟶ 58:
Requirements for Clustering Data Streams, Daniel Barbara
<ref>S. Guha, A. Meyerson, N. Mishra, R. Motwani. ''[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1198387&tag=1 Clustering Data Streams: Theory and Practice]''. IEEE Transactions on Knowledge and Data Engineering, Volume 15, Issue 3, 2003 </ref>
|