Data stream clustering: Difference between revisions

Content deleted Content added
Bfoteini (talk | contribs)
No edit summary
Bfoteini (talk | contribs)
No edit summary
Line 4:
 
== History ==
The problem of data stream clustering has recently attracted much attention for its applicability to emerging applications that involve a large
amount of streaming data such as network flows, sensor data, and web click streams. One of the first results on data streams was due to Munro and Paterson <ref>J.Munro and M. Paterson. Selection and Sorting with Limited Storage. ''Theoretical Computer Science'', pages 315-323, 1980</ref> but the model was formalized much later by Henzinger, Raghavan, and Rajagopalan <ref>M. Henzinger, P. Raghavan, and S. Rajagopalan. ''Computing on Data Streams. Digital Equipment Corporation, TR-1998-011'', August 1998.</ref>. The method usually used for data stream clustering is the [[k-means clustering | k-means]]
 
 
Line 14:
* The running time of the algorithm.
These algorithms have many similarities with [[online algorithms]] but they are not identical. Unlike online algorithms, algorithms for data stream clustering have only a bounded amount of memory available and they may be able to take action after a group of points arrives while online algorithms are required to take action after each point arrives.
 
Since data stream algorithms have limited memory available, the first goal is to show that clustering can take place in small space (not caring about the number of passes). Small-Space <ref>S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O’Callaghan, "Clustering Data Streams: Theory and Practice", IEEE Transactions on Knowledge and Data Engineering, Vol. 15, 2003<\ref> is a [[divide-and-conquer algorithm]] that divides the data into pieces, clusters each one of them (using k-means) and then clusters the centers obtained (each center is weighted by the number of points assigned to it)
 
Some of the most well-known algorithms used for data stream clustering include: