Count-distinct problem: Difference between revisions

Content deleted Content added
top: Typo fixing, fixed "Mahematical" by choosing a different "short description"
No edit summary
 
(One intermediate revision by one other user not shown)
Line 52:
for a practical overview with comparative simulation results.
 
=== Python Implementationimplementation of Knuth's CVM Algorithmalgorithm ===
<syntaxhighlight lang="python3" line="1">
def algorithm_d(stream, s: int):
m = len(stream) # We assume that this is given to us in advance.
t = -1 # Note that Knuth indexes the stream from 1.
p = 1
a = 0
Line 63:
t += 1
a = stream[t]
u = uniform(0, 1)
buffer = list(filter(lambda x : x[1] != a, buffer))
if u < p:
if (len(buffer) < s):
buffer.append([u, a])
else:
buffer = sorted(buffer)
p = max(buffer[-1][0], u)
buffer.pop()
buffer.append([u, a])
return len(buffer) / p
</syntaxhighlight>
 
=== CVM Algorithmalgorithm ===
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm<ref>{{Cite journalbook |last1=Chakraborty |first1=Sourav |last2=Vinodchandran |first2=N. V. |last3=Meel |first3=Kuldeep S. |date=2022 |title=Distinct Elements in Streams: An Algorithm for the (Text) Book |series=Leibniz International Proceedings in Informatics (LIPIcs) |volume=244 |pages=6 pages, 727571 bytes |publisher=Schloss Dagstuhl – Leibniz-Zentrum für Informatik |doi=10.4230/LIPIcs.ESA.2022.34 |doi-access=free |arxiv=2301.10191 |isbn=978-3-95977-247-1 |issn=1868-8969}}</ref> (named by [[Donald Knuth]] after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,<ref name=":0" /> in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. <ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
 
{{nowrap|Initialize <math> p \leftarrow 1 </math>}}
Line 148:
 
[[Category:Statistical algorithms]]
[[Category:Articles with example Python (programming language) code]]