Content deleted Content added
Added URL to Knuth's note. |
No edit summary |
||
(28 intermediate revisions by 13 users not shown) | |||
Line 1:
{{Short description|Problem in computer science}}
In computer science, the '''count-distinct problem'''<ref>{{cite journal | last1=Ullman | first1=Jeff |author1-link=Jeffrey Ullman| last2 = Rajaraman | first2 = Anand | last3=Leskovec | first3=Jure |author3-link=Jure Leskovec| title=Mining data streams | url=http://infolab.stanford.edu/~ullman/mmds/ch4.pdf}}
</ref>
Line 5 ⟶ 6:
==Formal definition==
: '''Instance''': Consider a stream of elements <math>x_1, x_2, \ldots, x_s
: '''Objective''': Find an estimate <math> \widehat{n} </math> of <math> n </math> using only <math> m </math> storage units, where <math> m \ll n </math>.
Line 39 ⟶ 40:
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element <math> e_j </math> is associated with a uniform [[Random variable|RV]], <math> h(e_j) \sim U(0,1) </math>, the expected minimum value of <math> h(e_1),h(e_2), \ldots, h(e_n) </math> is <math> 1/(n+1) </math>. The hash function guarantees that <math> h(e_j) </math> is identical for all the appearances of <math> e_j </math>. Thus, the existence of duplicates does not affect the value of the extreme order statistics.
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation<ref>{{cite journal | last1=Flajolet | first1=Philippe|author1-link=Philippe Flajolet | last2 = Martin | first2 = G. Nigel | year=1985 | title=Probabilistic counting algorithms for data base applications | journal=J. Comput. Syst. Sci. | volume=31| issue=2| pages=182–209| doi=10.1016/0022-0000(85)90041-8| url=https://hal.inria.fr/inria-00076244/file/RR-0313.pdf}} </ref> describes the [[Flajolet–Martin algorithm]], a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by [[Daniel Kane (mathematician)|Daniel M. Kane]], [[Jelani Nelson]], and David P. Woodruff.<ref name=optimalf0>{{cite journal | last1=Kane | first1=Daniel M. | last2 = Nelson | first2 = Jelani | last3=Woodruff | first3=David P. | year=2010 | authorlink1=
===Bottom-''m'' sketches===
Line 51 ⟶ 52:
for a practical overview with comparative simulation results.
=== Python implementation of Knuth's CVM
<syntaxhighlight lang="python3" line="1">
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream<ref name=":0" />, in addition to the standard (ε-δ) guarantees<ref>{{Cite journal |last=Chakraborty |first=Sourav |last2=Vinodchandran |first2=N. V. |last3=Meel |first3=Kuldeep S. |date=2022 |title=Distinct Elements in Streams: An Algorithm for the (Text) Book |url=http://arxiv.org/abs/2301.10191 |pages=6 pages, 727571 bytes |doi=10.4230/LIPIcs.ESA.2022.34 |issn=1868-8969}}</ref>. Here is the algorithm:<ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>▼
def algorithm_d(stream, s: int):
m = len(stream) # We assume that this is given to us in advance.
t = -1 # Note that Knuth indexes the stream from 1.
p = 1
a = 0
buffer = []
while t < (m - 1):
t += 1
a = stream[t]
u = uniform(0, 1)
buffer = list(filter(lambda x: x[1] != a, buffer))
if u < p:
if len(buffer) < s:
buffer.append([u, a])
else:
buffer = sorted(buffer)
p = max(buffer[-1][0], u)
buffer.pop()
buffer.append([u, a])
return len(buffer) / p
</syntaxhighlight>
=== CVM algorithm ===
▲Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm
Initialize an empty buffer, {{mvar|B}}. ▼
{{nowrap|Initialize <math> p \leftarrow 1 </math>}}
Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math>
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
{{nowrap|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}}
{{nowrap|If <math> u < p </math> then}}
{{nowrap|If <math> |B| < s </math> then}}
insert <math> (a_t, u) </math> in {{mvar|B}}
else
<math>(a',u')</math> such that <math>u' = \max\{u'':(a'',u'')\in B, \forall a''\}</math> /* <math>(a',u')</math> whose <math>u'</math> is maximum in {{mvar|B}} */
If <math> u > u' </math> then
<math>p\leftarrow u</math>
else
Replace <math>(a',u')</math> with <math> (a_t, u) </math>
<math>p\leftarrow u'</math>
{{nowrap|return <math> |B| / p </math>.}}
The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. <ref name=":0">{{cite journal |last1=Knuth |first1=Donald |date=May 2023 |title=The CVM Algorithm for Estimating Distinct Elements in Streams |url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf |journal=}}</ref>
{{nowrap|Initialize <math> p \leftarrow 1 </math>}}
Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math>
Initialize an empty buffer, {{mvar|B}}
{{nowrap|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do:
{{nowrap|If <math>
{{nowrap|Delete <math>
▲ {{nowrap|If <math> a </math> is in {{mvar|B}} then}}
▲ {{nowrap|Delete <math> a </math> from {{mvar|B}}}}
{{nowrap|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}}
{{nowrap|If <math> u \leq p </math> then}}
Insert <math> (
{{nowrap|
{{nowrap|
If <math> u < p </math> then
▲ {{nowrap|End For}}
{{nowrap|End For}}
{{nowrap|return <math> |B| / p </math>.}}
==Weighted count-distinct problem==
Line 100 ⟶ 148:
[[Category:Statistical algorithms]]
[[Category:Articles with example Python (programming language) code]]
|