Count-distinct problem: Difference between revisions

Content deleted Content added
Simplified to the version that's easy to understand
No edit summary
 
(22 intermediate revisions by 11 users not shown)
Line 1:
{{Short description|Problem in computer science}}
In computer science, the '''count-distinct problem'''<ref>{{cite journal | last1=Ullman | first1=Jeff |author1-link=Jeffrey Ullman| last2 = Rajaraman | first2 = Anand | last3=Leskovec | first3=Jure |author3-link=Jure Leskovec| title=Mining data streams | url=http://infolab.stanford.edu/~ullman/mmds/ch4.pdf}}
</ref>
Line 5 ⟶ 6:
 
==Formal definition==
: '''Instance''': Consider a stream of elements <math>x_1, x_2, \ldots, x_s </math> with repetitions, and an integer <math> m </math>. Let <math>n</math> denote the number of distinct elements in the stream, representedwith asthe <math>\{e_1,set e_2,of \ldots,distinct e_n\}</math>,elements whererepresented as <math>n = |\{e_1, e_2, \ldots, e_n\}|</math>.
 
: '''Objective''': Find an estimate <math> \widehat{n} </math> of <math> n </math> using only <math> m </math> storage units, where <math> m \ll n </math>.
Line 39 ⟶ 40:
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element <math> e_j </math> is associated with a uniform [[Random variable|RV]], <math> h(e_j) \sim U(0,1) </math>, the expected minimum value of <math> h(e_1),h(e_2), \ldots, h(e_n) </math> is <math> 1/(n+1) </math>. The hash function guarantees that <math> h(e_j) </math> is identical for all the appearances of <math> e_j </math>. Thus, the existence of duplicates does not affect the value of the extreme order statistics.
 
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation<ref>{{cite journal | last1=Flajolet | first1=Philippe|author1-link=Philippe Flajolet | last2 = Martin | first2 = G. Nigel | year=1985 | title=Probabilistic counting algorithms for data base applications | journal=J. Comput. Syst. Sci. | volume=31| issue=2| pages=182–209| doi=10.1016/0022-0000(85)90041-8| url=https://hal.inria.fr/inria-00076244/file/RR-0313.pdf}} </ref> describes the [[Flajolet–Martin algorithm]], a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by [[Daniel Kane (mathematician)|Daniel M. Kane]], [[Jelani Nelson]], and David P. Woodruff.<ref name=optimalf0>{{cite journal | last1=Kane | first1=Daniel M. | last2 = Nelson | first2 = Jelani | last3=Woodruff | first3=David P. | year=2010 | authorlink1=Daniel_Kane_Daniel Kane (mathematician) | authorlink2=Jelani_NelsonJelani Nelson | title=An Optimal Algorithm for the Distinct Elements Problem | journal=Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS)|url=https://dash.harvard.edu/bitstream/handle/1/13820438/f0.pdf;sequence=1}} </ref>
 
===Bottom-''m'' sketches===
Line 51 ⟶ 52:
for a practical overview with comparative simulation results.
 
=== Python implementation of Knuth's CVM Algorithmalgorithm ===
<syntaxhighlight lang="python3" line="1">
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm<ref>{{Cite journal |last=Chakraborty |first=Sourav |last2=Vinodchandran |first2=N. V. |last3=Meel |first3=Kuldeep S. |date=2022 |title=Distinct Elements in Streams: An Algorithm for the (Text) Book |url=http://arxiv.org/abs/2301.10191 |pages=6 pages, 727571 bytes |doi=10.4230/LIPIcs.ESA.2022.34 |issn=1868-8969}}</ref> (named by [[Donald Knuth | Donald Knuth]] after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees. Below is the modification of CVM algorithm, proposed by Donald Knuth, that maintains a buffer of maximum size s<ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
def algorithm_d(stream, s: int):
m = len(stream) # We assume that this is given to us in advance.
t = -1 # Note that Knuth indexes the stream from 1.
p = 1
a = 0
buffer = []
while t < (m - 1):
t += 1
a = stream[t]
u = uniform(0, 1)
buffer = list(filter(lambda x: x[1] != a, buffer))
if u < p:
if len(buffer) < s:
buffer.append([u, a])
else:
buffer = sorted(buffer)
p = max(buffer[-1][0], u)
buffer.pop()
buffer.append([u, a])
return len(buffer) / p
</syntaxhighlight>
 
=== CVM algorithm ===
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm<ref>{{Cite journalbook |lastlast1=Chakraborty |firstfirst1=Sourav |last2=Vinodchandran |first2=N. V. |last3=Meel |first3=Kuldeep S. |date=2022 |title=Distinct Elements in Streams: An Algorithm for the (Text) Book |urlseries=http://arxiv.org/abs/2301.10191Leibniz International Proceedings in Informatics (LIPIcs) |volume=244 |pages=6 pages, 727571 bytes |publisher=Schloss Dagstuhl – Leibniz-Zentrum für Informatik |doi=10.4230/LIPIcs.ESA.2022.34 |doi-access=free |arxiv=2301.10191 |isbn=978-3-95977-247-1 |issn=1868-8969}}</ref> (named by [[Donald Knuth | Donald Knuth]] after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,<ref name=":0" />, in addition to the standard (ε-δ) guarantees. Below is the modification of CVM algorithm, proposedincluding the slight modification by Donald Knuth, that maintains a buffer of maximum size. s<ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
 
{{nowrap|Initialize <math> p \leftarrow 1 </math>}}
Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math>
Initialize an empty buffer, {{mvar|B}}
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
{{nowrap|If <math> (a_t, u), \forall u</math> is in {{mvar|B}} then}}
{{nowrap|Delete <math> (a_t, u) </math> from {{mvar|B}}}}
{{nowrap|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}}
{{nowrap|If <math> u < p </math> then}}
{{nowrap|If <math> |B| < s </math> then}}
insert <math> (a_t, u) </math> in {{mvar|B}}
else
<math>(a',u')</math> such that <math>u' = \max\{u'':(a'',u'')\in B, \forall a''\}</math> /* <math>(a',u')</math> whose <math>u'</math> is maximum in {{mvar|B}} */
If <math> u > u' </math> then
<math>p\leftarrow u</math>
else
Replace <math>(a',u')</math> with <math> (a_t, u) </math>
<math>p\leftarrow u'</math>
{{nowrap|End For}}
{{nowrap|return <math> |B| / p </math>.}}
 
The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. <ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
 
{{nowrap|Initialize <math> p \leftarrow 1 </math>}}
Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math>
Initialize an empty buffer, {{mvar|B}}
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
Line 62 ⟶ 110:
{{nowrap|If <math> u \leq p </math> then}}
Insert <math> (a_t, u) </math> into {{mvar|B}}
{{nowrap|While <math> |B| \geq= s \wedge u < p </math> then}}
{{nowrap|Remove every element of <math>(a', B u')</math> withof probability{{mvar|B}} with <math> u' > \frac{1p}{2} </math>}}
{{nowrap|<math> p \leftarrow \frac{p}{2} </math>}}
{{nowrap|End ForWhile}}
{{nowrap|return If <math> |B|u /< p </math>.}} then
Insert <math> (a_t, u) </math> into {{mvar|B}}
{{nowrap|End For}}
{{nowrap|return <math> |B| / p </math>.}}
 
==Weighted count-distinct problem==
Line 97 ⟶ 148:
 
[[Category:Statistical algorithms]]
[[Category:Articles with example Python (programming language) code]]