Content deleted Content added
repetition |
Added a new algorithm for estimating the count-distinct problem |
||
Line 50:
</ref>
for a practical overview with comparative simulation results.
==CVM Algorithm==
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream, however the bound on error is not currently known. Here is the algorithm<ref>{{cite journal |last1=Knuth |first1=Donald |title=The CVM Algorithm for Estimating Distinct Elements in Streams |date=May 2023}}</ref>:
Initialize a counter, {{mvar|t}}, to zero, {{nowrap|<math> t \leftarrow 0 </math>.}}
Initialize an empty buffer, {{mvar|B}}.
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
{{nowrap|<math> t \leftarrow t + 1 </math>}}
{{nowrap|<math> a \leftarrow a_t </math>}}
{{nowrap|If <math> a </math> is in {{mvar|B}} then}}
{{nowrap|Delete <math> a </math> from {{mvar|B}}}}
{{nowrap|End If}}
{{nowrap|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}}
{{nowrap|If <math> u \geq p </math> then}}
{{nowrap|NEXT}}
{{nowrap|Else If <math> |B| < s </math> then}}
{{nowrap|Insert <math> (a, u) </math> into {{mvar|B}}}}
{{nowrap|NEXT}}
{{nowrap|Else}}
{{nowrap|<math> (a', u') \leftarrow (a, u) </math> of max <math> u </math> in {{mvar|B}}}}
{{nowrap|If <math> u > u' </math> then}}
{{nowrap|<math> p \leftarrow u </math>}}
{{nowrap|Else}}
{{nowrap|Delete <math> (a', u') </math> from {{mvar|B}}}}
{{nowrap|Insert <math> (a, u) </math> into {{mvar|B}}}}
{{nowrap|<math> p \leftarrow u' </math>}}
{{nowrap|End If}}
{{nowrap|End If}}
{{nowrap|End For}}
{{nowrap|return <math> |B| / p </math>.}}
==Weighted count-distinct problem==
|