Revision as of 21:12, 20 May 2024 edit 128.100.1.66 (talk) Added URL to Knuth's note. Tag: Visual edit ← Previous edit		Revision as of 21:21, 20 May 2024 edit undo 128.100.1.66 (talk) Further edited algorithm description Next edit →
Line 54: Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees<ref>{{Cite journal \|last=Chakraborty \|first=Sourav \|last2=Vinodchandran \|first2=N. V. \|last3=Meel \|first3=Kuldeep S. \|date=2022 \|title=Distinct Elements in Streams: An Algorithm for the (Text) Book \|url=http://arxiv.org/abs/2301.10191 \|pages=6 pages, 727571 bytes \|doi=10.4230/LIPIcs.ESA.2022.34 \|issn=1868-8969}}</ref>. Here is the algorithm:<ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> ~~Initialize a counter,~~ {{~~mvar~~nowrap\|~~t}},~~Initialize ~~to zero, {{nowrap\|~~<math> tp \leftarrow 01 </math>.}} Initialize an empty buffer, {{mvar\|B}}. {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|If <math> ~~t \leftarrow t + 1~~a_t </math> is in {{mvar\|B}} then}} {{nowrap\|Delete <math> ~~a \leftarrow~~ a_t </math> from {{mvar\|B}}}} ~~{{nowrap\|If <math> a </math> is in {{mvar\|B}} then}}~~ ~~{{nowrap\|Delete <math> a </math> from {{mvar\|B}}}}~~ {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u \leq p </math> then}} Insert <math> (aa_t, u) </math> into {{mvar\|B}} {{nowrap\|If <math> \|B\| > s </math> then}} {{nowrap\|<math> (a', u') \leftarrow (a, u) </math> of max <math> u </math> in {{mvar\|B}}}}

Count-distinct problem: Difference between revisions