Data stream clustering: Difference between revisions

Content deleted Content added
Bfoteini (talk | contribs)
No edit summary
Bfoteini (talk | contribs)
No edit summary
Line 1:
In [[computer science]], data stream [[clusteringcluster analysis | clustering]] is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied under the [[streaming algorithm | data stream model]] of computation and the objective is, given a sequence of points, to maintainconstruct a consistently good clustering of the sequence observed so farstream, using a small amount of memory and time.
 
<!-- in contrary to the traditional clustering where data are static. -->
 
== History ==
amountData stream clustering has recently attracted attention for emerging applications that involve large amounts of streaming data. For clustering, [[k-means clustering | k-means]] is a widely used heuristic but alternate algorithms have also been developed such as network flows[[k-Medoids]], sensor[[CURE data, andclustering webalgorithm click| streams.CURE]] One ofand the firstpopular results[[BIRCH(data onclustering) | BIRCH]]. For data streams, wasone dueof tothe Munrofirst andresults Patersonappeared in 1980 <ref>J.Munro and M. Paterson. [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4567985 Selection and Sorting with Limited Storage]. ''Theoretical Computer Science'', pages 315-323, 1980</ref> but the model was formalized muchin later by Henzinger, Raghavan, and Rajagopalan1998 <ref>M. Henzinger, P. Raghavan, and S. Rajagopalan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.9554 Computing on Data Streams]. ''Digital Equipment Corporation, TR-1998-011'', August 1998.</ref>. [[k-means clustering | K-means]] is a widely used heuristic for clustering but also alternate algorithms for clustering have been developed such as [[k-Medoids]], [[CURE data clustering algorithm | CURE]] and the popular [[BIRCH(data clustering) | BIRCH]].
The problem of data stream clustering has recently attracted much attention for its applicability to emerging applications that involve large
 
amount of streaming data such as network flows, sensor data, and web click streams. One of the first results on data streams was due to Munro and Paterson <ref>J.Munro and M. Paterson. [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4567985 Selection and Sorting with Limited Storage]. ''Theoretical Computer Science'', pages 315-323, 1980</ref> but the model was formalized much later by Henzinger, Raghavan, and Rajagopalan <ref>M. Henzinger, P. Raghavan, and S. Rajagopalan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.9554 Computing on Data Streams]. ''Digital Equipment Corporation, TR-1998-011'', August 1998.</ref>. [[k-means clustering | K-means]] is a widely used heuristic for clustering but also alternate algorithms for clustering have been developed such as [[k-Medoids]], [[CURE data clustering algorithm | CURE]] and the popular [[BIRCH(data clustering) | BIRCH]].
 
== Definition ==
The problem of data stream clustering is defined as:
 
Input: a sequence of <math>n</math> points in metric space and an integer <math>k</math>.<br />
Output: <math>k</math> centers in the set of the <math>n</math> points so as to minimize the sum of distances from data points to their closest cluster centers.
 
== Algorithms ==
These<!-- algorithms have many similarities with [[online algorithms]] but they are not identical. Unlike online algorithms, algorithms for data stream clustering have only a bounded amount of memory available and they may be able to take action after a group of points arrives while online algorithms are required to take action after each point arrives. -->
Many algorithms have been proposed for the data stream clustering problem. The performance of an algorithm that operates on data streams is measured by three basic factors:
* The number of passes the algorithm must make over the stream.
* The available memory.
* The running time of the algorithm.
These algorithms have many similarities with [[online algorithms]] but they are not identical. Unlike online algorithms, algorithms for data stream clustering have only a bounded amount of memory available and they may be able to take action after a group of points arrives while online algorithms are required to take action after each point arrives.
 
=== STREAM ===
Since data stream algorithms have limited memory available, the first goal is to show that clustering can take place in small space (not caring about the number of passes). Small-Space<ref>S. Guha, A. Meyerson, N. Mishra, R. Motwani. ''[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1198387&tag=1 Clustering Data Streams: Theory and Practice]''. IEEE Transactions on Knowledge and Data Engineering, Volume 15, Issue 3, 2003 </ref> is a [[divide-and-conquer algorithm]] that divides the data into pieces, clusters each one of them (using k-means) and then clusters the centers obtained (where each center is weighted by the number of points assigned to it).
 
STREAM is an algorithm for clustering data streams described by Guha, Mishra, Motwani and O'Callaghan <ref>S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1927 Clustering Data Streams]. Proceedings of the Annual Symposium on Foundations of Computer Science, 2000</ref> which achieves a constant factor approximation algorithm for the k-Median problem in a single pass and using small space.
 
'''''Theorem''': STREAM can solve the k-Median problem on a data stream with time <math>O(n^{1+\epsilon })</math> and space <math>\theta(n^{\epsilon})</math> up to a factor <math>2^{O({1/ \epsilon})}</math>, where <math>n</math> the number of points and <math>\epsilon < 1/2</math>''.
 
Since data stream algorithms have limited memory available, theThe first goal is to show that clustering can take place in small space (not caring about the number of passes). Small-Space<ref>S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O''Callaghan. [http://ieeexploreciteseerx.ieeeist.orgpsu.edu/xplsviewdoc/abs_all.jspsummary?arnumber=1198387&tagdoi=10.1.1.32.1927 Clustering Data Streams: Theory and Practice]''. IEEEProceedings Transactionsof onthe KnowledgeAnnual andSymposium Dataon Engineering,Foundations Volumeof 15,Computer Issue 3Science, 2003 2000</ref> is a [[divide-and-conquer algorithm]] that divides the data into pieces, clusters each one of them (using k-means) and then clusters the centers obtained (where each center is weighted by the number of points assigned to it).
 
<blockquote>
Algorithm Small-Space(S)
 
1. Divide ''S'' into ''l'' disjoint pieces <math>\chi_1, \dots, \chi_l</math>.
2. For each ''i'', find ''O(k)'' centers in <math>\chi_i</math>, using ''k''-means. Assign
each point in <math>\chi_i</math> to its closest center.
3. Let <math>\chi'</math> be the ''O(lk)'' centers obtained in (2),
where each center ''c'' is weighted by the number
of points assigned to it.
4. Cluster <math>\chi'</math> to find ''k'' centers.
</blockquote>
 
[[File:Small-Space.jpg | thumb | 500x230px | center | Small-Space Algorithm representation]]
Line 21 ⟶ 40:
The approximation factor of this algorithm is <math>2c(1+2b)+2b</math> and if we generalize it so that it recursively calls itself, <math>i</math> times on a successively smaller set of weighted centers then it gives a constant factor approximation to the k-median problem, which, as expected, worsens with each successive reclustering.
 
STREAM is an algorithm for clustering data streams described by Guha, Mishra, Motwani and O'Callaghan <ref>S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1927 Clustering Data Streams]. Proceedings of the Annual Symposium on Foundations of Computer Science, 2000</ref> which achieves a constant factor approximation algorithm for the k-Median problem in a single pass.
 
Other well-known algorithms used for data stream clustering are:
Line 40 ⟶ 58:
 
Requirements for Clustering Data Streams, Daniel Barbara
 
<ref>S. Guha, A. Meyerson, N. Mishra, R. Motwani. ''[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1198387&tag=1 Clustering Data Streams: Theory and Practice]''. IEEE Transactions on Knowledge and Data Engineering, Volume 15, Issue 3, 2003 </ref>