Count-distinct problem: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:12, 20 May 2024 edit 128.100.1.66 (talk) Added URL to Knuth's note. Tag: Visual edit ← Previous edit		Latest revision as of 12:59, 30 April 2025 edit undo Frap (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 35,598 edits No edit summary
(28 intermediate revisions by 13 users not shown)
Line 1: {{Short description\|Problem in computer science}} In computer science, the '''count-distinct problem'''<ref>{{cite journal \| last1=Ullman \| first1=Jeff \|author1-link=Jeffrey Ullman\| last2 = Rajaraman \| first2 = Anand \| last3=Leskovec \| first3=Jure \|author3-link=Jure Leskovec\| title=Mining data streams \| url=http://infolab.stanford.edu/~ullman/mmds/ch4.pdf}} </ref> Line 5 ⟶ 6: ==Formal definition== : '''Instance''': Consider a stream of elements <math>x_1, x_2, \ldots, x_s </math> with repetitions~~, and an integer <math> m </math>~~. Let <math>n</math> denote the number of distinct elements in the stream, ~~represented~~with asthe ~~<math>\{e_1,~~set ~~e_2,~~of ~~\ldots,~~distinct ~~e_n\}</math>,~~elements ~~where~~represented as <math>~~n = \|~~\{e_1, e_2, \ldots, e_n\}\|</math>. : '''Objective''': Find an estimate <math> \widehat{n} </math> of <math> n </math> using only <math> m </math> storage units, where <math> m \ll n </math>. Line 39 ⟶ 40: The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element <math> e_j </math> is associated with a uniform [[Random variable\|RV]], <math> h(e_j) \sim U(0,1) </math>, the expected minimum value of <math> h(e_1),h(e_2), \ldots, h(e_n) </math> is <math> 1/(n+1) </math>. The hash function guarantees that <math> h(e_j) </math> is identical for all the appearances of <math> e_j </math>. Thus, the existence of duplicates does not affect the value of the extreme order statistics. There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation<ref>{{cite journal \| last1=Flajolet \| first1=Philippe\|author1-link=Philippe Flajolet \| last2 = Martin \| first2 = G. Nigel \| year=1985 \| title=Probabilistic counting algorithms for data base applications \| journal=J. Comput. Syst. Sci. \| volume=31\| issue=2\| pages=182–209\| doi=10.1016/0022-0000(85)90041-8\| url=https://hal.inria.fr/inria-00076244/file/RR-0313.pdf}} </ref> describes the [[Flajolet–Martin algorithm]], a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by [[Daniel Kane (mathematician)\|Daniel M. Kane]], [[Jelani Nelson]], and David P. Woodruff.<ref name=optimalf0>{{cite journal \| last1=Kane \| first1=Daniel M. \| last2 = Nelson \| first2 = Jelani \| last3=Woodruff \| first3=David P. \| year=2010 \| authorlink1=~~Daniel_Kane_~~Daniel Kane (mathematician) \| authorlink2=~~Jelani_Nelson~~Jelani Nelson \| title=An Optimal Algorithm for the Distinct Elements Problem \| journal=Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS)\|url=https://dash.harvard.edu/bitstream/handle/1/13820438/f0.pdf;sequence=1}} </ref> ===Bottom-''m'' sketches=== Line 51 ⟶ 52: for a practical overview with comparative simulation results. === Python implementation of Knuth's CVM ~~Algorithm~~algorithm === <syntaxhighlight lang="python3" line="1"> Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees<ref>{{Cite journal \|last=Chakraborty \|first=Sourav \|last2=Vinodchandran \|first2=N. V. \|last3=Meel \|first3=Kuldeep S. \|date=2022 \|title=Distinct Elements in Streams: An Algorithm for the (Text) Book \|url=http://arxiv.org/abs/2301.10191 \|pages=6 pages, 727571 bytes \|doi=10.4230/LIPIcs.ESA.2022.34 \|issn=1868-8969}}</ref>. Here is the algorithm:<ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref>▼ def algorithm_d(stream, s: int): m = len(stream) # We assume that this is given to us in advance. t = -1 # Note that Knuth indexes the stream from 1. p = 1 a = 0 buffer = [] while t < (m - 1): t += 1 a = stream[t] u = uniform(0, 1) buffer = list(filter(lambda x: x[1] != a, buffer)) if u < p: if len(buffer) < s: buffer.append([u, a]) else: buffer = sorted(buffer) p = max(buffer[-1][0], u) buffer.pop() buffer.append([u, a]) return len(buffer) / p </syntaxhighlight> === CVM algorithm === ~~Initialize a counter, {{mvar\|t}}, to zero, {{nowrap\|<math> t \leftarrow 0 </math>.}}~~ ▲Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm ~~uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees~~<ref>{{Cite ~~journal~~book \|~~last~~last1=Chakraborty \|~~first~~first1=Sourav \|last2=Vinodchandran \|first2=N. V. \|last3=Meel \|first3=Kuldeep S. \|date=2022 \|title=Distinct Elements in Streams: An Algorithm for the (Text) Book \|~~url~~series=~~http://arxiv.org/abs/2301.10191~~Leibniz International Proceedings in Informatics (LIPIcs) \|volume=244 \|pages=6 pages, 727571 bytes \|publisher=Schloss Dagstuhl – Leibniz-Zentrum für Informatik \|doi=10.4230/LIPIcs.ESA.2022.34 \|doi-access=free \|arxiv=2301.10191 \|isbn=978-3-95977-247-1 \|issn=1868-8969}}</ref> (named by [[Donald Knuth]] after the initials of Sourav Chakraborty, N. ~~Here~~V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,<ref name=":0" /> in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm:, including the slight modification by Donald Knuth. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> Initialize an empty buffer, {{mvar\|B}}. ▼ {{nowrap\|Initialize <math> p \leftarrow 1 </math>}} Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math> ▲ Initialize an empty buffer, {{mvar\|B}}. {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|If <math> a(a_t, u), \forall u</math> is in {{mvar\|B}} then}}▼ {{nowrap\|Delete <math> a(a_t, u) </math> from {{mvar\|B}}}}▼ {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u < p </math> then}} {{nowrap\|If <math> \|B\| < s </math> then}} insert <math> (a_t, u) </math> in {{mvar\|B}} else <math>(a',u')</math> such that <math>u' = \max\{u'':(a'',u'')\in B, \forall a''\}</math> /* <math>(a',u')</math> whose <math>u'</math> is maximum in {{mvar\|B}} */ If <math> u > u' </math> then <math>p\leftarrow u</math> else Replace <math>(a',u')</math> with <math> (a_t, u) </math> <math>p\leftarrow u'</math> {{nowrap\|End For}}▼ {{nowrap\|return <math> \|B\| / p </math>.}} The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> {{nowrap\|Initialize <math> p \leftarrow 1 </math>}} Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math> Initialize an empty buffer, {{mvar\|B}} {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|If <math> ~~t \leftarrow t + 1~~a_t </math> is in {{mvar\|B}} then}} {{nowrap\|Delete <math> ~~a \leftarrow~~ a_t </math> from {{mvar\|B}}}} ▲ {{nowrap\|If <math> a </math> is in {{mvar\|B}} then}} ▲ {{nowrap\|Delete <math> a </math> from {{mvar\|B}}}} {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u \leq p </math> then}} Insert <math> (aa_t, u) </math> into {{mvar\|B}} {{nowrap\|IfWhile <math> \|B\| >= s \wedge u < p </math> then}} ~~{{nowrap\|~~Remove every element of <math> (a', u') ~~\leftarrow (a, u)~~ </math> of ~~max~~{{mvar\|B}} with <math> u' ~~</math~~> in \frac{p}{~~mvar\|B}}}~~2} </math> {{nowrap\|~~Delete~~ <math> ~~(a',~~p ~~u')~~\leftarrow \frac{p}{2} </math> ~~from {{mvar\|B}}~~}} {{nowrap\|~~<math> p \leftarrow u'~~End ~~</math>~~While}} If <math> u < p </math> then ▲ {{nowrap\|End For}} ~~{{nowrap\|return~~ Insert <math> ~~\|B\|~~(a_t, ~~/ p~~u) </math>. into {{mvar\|B}} {{nowrap\|End For}} {{nowrap\|return <math> \|B\| / p </math>.}} ==Weighted count-distinct problem== Line 100 ⟶ 148: [[Category:Statistical algorithms]] [[Category:Articles with example Python (programming language) code]]