Revision as of 20:32, 27 October 2023 edit Retimuko (talk \| contribs) Extended confirmed users 7,458 edits repetition ← Previous edit		Revision as of 21:04, 5 December 2023 edit undo Neg12345 (talk \| contribs) 1 edit Added a new algorithm for estimating the count-distinct problem Next edit →
Line 50: </ref> for a practical overview with comparative simulation results. ==CVM Algorithm== Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream, however the bound on error is not currently known. Here is the algorithm<ref>{{cite journal \|last1=Knuth \|first1=Donald \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|date=May 2023}}</ref>: Initialize a counter, {{mvar\|t}}, to zero, {{nowrap\|<math> t \leftarrow 0 </math>.}} Initialize an empty buffer, {{mvar\|B}}. {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|<math> t \leftarrow t + 1 </math>}} {{nowrap\|<math> a \leftarrow a_t </math>}} {{nowrap\|If <math> a </math> is in {{mvar\|B}} then}} {{nowrap\|Delete <math> a </math> from {{mvar\|B}}}} {{nowrap\|End If}} {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u \geq p </math> then}} {{nowrap\|NEXT}} {{nowrap\|Else If <math> \|B\| < s </math> then}} {{nowrap\|Insert <math> (a, u) </math> into {{mvar\|B}}}} {{nowrap\|NEXT}} {{nowrap\|Else}} {{nowrap\|<math> (a', u') \leftarrow (a, u) </math> of max <math> u </math> in {{mvar\|B}}}} {{nowrap\|If <math> u > u' </math> then}} {{nowrap\|<math> p \leftarrow u </math>}} {{nowrap\|Else}} {{nowrap\|Delete <math> (a', u') </math> from {{mvar\|B}}}} {{nowrap\|Insert <math> (a, u) </math> into {{mvar\|B}}}} {{nowrap\|<math> p \leftarrow u' </math>}} {{nowrap\|End If}} {{nowrap\|End If}} {{nowrap\|End For}} {{nowrap\|return <math> \|B\| / p </math>.}} ==Weighted count-distinct problem==

Count-distinct problem: Difference between revisions