Count-distinct problem: Difference between revisions

Content deleted Content added
Added URL to Knuth's note.
Further edited algorithm description
Line 54:
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees<ref>{{Cite journal |last=Chakraborty |first=Sourav |last2=Vinodchandran |first2=N. V. |last3=Meel |first3=Kuldeep S. |date=2022 |title=Distinct Elements in Streams: An Algorithm for the (Text) Book |url=http://arxiv.org/abs/2301.10191 |pages=6 pages, 727571 bytes |doi=10.4230/LIPIcs.ESA.2022.34 |issn=1868-8969}}</ref>. Here is the algorithm:<ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
 
Initialize a counter, {{mvarnowrap|t}},Initialize to zero, {{nowrap|<math> tp \leftarrow 01 </math>.}}
Initialize an empty buffer, {{mvar|B}}.
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
{{nowrap|If <math> t \leftarrow t + 1a_t </math> is in {{mvar|B}} then}}
{{nowrap|Delete <math> a \leftarrow a_t </math> from {{mvar|B}}}}
{{nowrap|If <math> a </math> is in {{mvar|B}} then}}
{{nowrap|Delete <math> a </math> from {{mvar|B}}}}
{{nowrap|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}}
{{nowrap|If <math> u \leq p </math> then}}
Insert <math> (aa_t, u) </math> into {{mvar|B}}
{{nowrap|If <math> |B| > s </math> then}}
{{nowrap|<math> (a', u') \leftarrow (a, u) </math> of max <math> u </math> in {{mvar|B}}}}