Count-distinct problem: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 08:32, 13 November 2024 edit Pychron (talk \| contribs) 31 edits m typo ← Previous edit		Latest revision as of 12:59, 30 April 2025 edit undo Frap (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 35,598 edits No edit summary
(6 intermediate revisions by 4 users not shown)
Line 1: {{Short description\|~~Mahematical~~Problem ~~arithmertic~~in computer science}} In computer science, the '''count-distinct problem'''<ref>{{cite journal \| last1=Ullman \| first1=Jeff \|author1-link=Jeffrey Ullman\| last2 = Rajaraman \| first2 = Anand \| last3=Leskovec \| first3=Jure \|author3-link=Jure Leskovec\| title=Mining data streams \| url=http://infolab.stanford.edu/~ullman/mmds/ch4.pdf}} </ref> Line 52: for a practical overview with comparative simulation results. === Python implementation of Knuth's CVM ~~Algorithm~~algorithm === <syntaxhighlight lang="python3" line="1"> Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm<ref>{{Cite journal \|last1=Chakraborty \|first1=Sourav \|last2=Vinodchandran \|first2=N. V. \|last3=Meel \|first3=Kuldeep S. \|date=2022 \|title=Distinct Elements in Streams: An Algorithm for the (Text) Book \|pages=6 pages, 727571 bytes \|publisher=Schloss Dagstuhl – Leibniz-Zentrum für Informatik \|doi=10.4230/LIPIcs.ESA.2022.34 \|doi-access=free \|arxiv=2301.10191 \|issn=1868-8969}}</ref> (named by [[Donald Knuth]] after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,<ref name=":0" /> in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref>▼ def algorithm_d(stream, s: int): m = len(stream) # We assume that this is given to us in advance. t = -1 # Note that Knuth indexes the stream from 1. p = 1 a = 0 buffer = [] while t < (m - 1): t += 1 a = stream[t] u = uniform(0, 1) buffer = list(filter(lambda x: x[1] != a, buffer)) if u < p: if len(buffer) < s: buffer.append([u, a]) else: buffer = sorted(buffer) p = max(buffer[-1][0], u) buffer.pop() buffer.append([u, a]) return len(buffer) / p </syntaxhighlight> === CVM algorithm === ▲Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm<ref>{{Cite ~~journal~~book \|last1=Chakraborty \|first1=Sourav \|last2=Vinodchandran \|first2=N. V. \|last3=Meel \|first3=Kuldeep S. \|date=2022 \|title=Distinct Elements in Streams: An Algorithm for the (Text) Book \|series=Leibniz International Proceedings in Informatics (LIPIcs) \|volume=244 \|pages=6 pages, 727571 bytes \|publisher=Schloss Dagstuhl – Leibniz-Zentrum für Informatik \|doi=10.4230/LIPIcs.ESA.2022.34 \|doi-access=free \|arxiv=2301.10191 \|isbn=978-3-95977-247-1 \|issn=1868-8969}}</ref> (named by [[Donald Knuth]] after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,<ref name=":0" /> in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> {{nowrap\|Initialize <math> p \leftarrow 1 </math>}} Line 70 ⟶ 94: <math>p\leftarrow u</math> else Replace <math>(a',u')</math> with <math> (aa_t, u) </math> <math>p\leftarrow u'</math> {{nowrap\|End For}} {{nowrap\|return <math> \|B\| / p </math>.}} The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> {{nowrap\|Initialize <math> p \leftarrow 1 </math>}} Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math> Initialize an empty buffer, {{mvar\|B}} {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|If <math> a_t </math> is in {{mvar\|B}} then}} {{nowrap\|Delete <math> a_t </math> from {{mvar\|B}}}} {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u \leq p </math> then}} Insert <math> (a_t, u) </math> into {{mvar\|B}} {{nowrap\|While <math> \|B\| = s \wedge u < p </math> then}} Remove every element of <math>(a', u')</math> of {{mvar\|B}} with <math> u' > \frac{p}{2} </math> {{nowrap\|<math> p \leftarrow \frac{p}{2} </math>}} {{nowrap\|End While}} If <math> u < p </math> then Insert <math> (a_t, u) </math> into {{mvar\|B}} {{nowrap\|End For}} {{nowrap\|return <math> \|B\| / p </math>.}} Line 104 ⟶ 148: [[Category:Statistical algorithms]] [[Category:Articles with example Python (programming language) code]]