Content deleted Content added
Added URL to Knuth's note. |
Further edited algorithm description |
||
Line 54:
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees<ref>{{Cite journal |last=Chakraborty |first=Sourav |last2=Vinodchandran |first2=N. V. |last3=Meel |first3=Kuldeep S. |date=2022 |title=Distinct Elements in Streams: An Algorithm for the (Text) Book |url=http://arxiv.org/abs/2301.10191 |pages=6 pages, 727571 bytes |doi=10.4230/LIPIcs.ESA.2022.34 |issn=1868-8969}}</ref>. Here is the algorithm:<ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
Initialize an empty buffer, {{mvar|B}}
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
{{nowrap|If <math>
{{nowrap|Delete <math>
{{nowrap|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}}
{{nowrap|If <math> u \leq p </math> then}}
Insert <math> (
{{nowrap|If <math> |B| > s </math> then}}
{{nowrap|<math> (a', u') \leftarrow (a, u) </math> of max <math> u </math> in {{mvar|B}}}}
|