Count-distinct problem: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:50, 6 March 2020 edit Citation bot (talk \| contribs) Bots 5,871,220 edits m Add: citeseerx, doi, pages, issue, volume, arxiv. Removed URL that duplicated unique identifier. Formatted dashes. \| You can use this bot yourself. Report bugs here. \| Activated by User:AManWithNoPlan \| via #UCB_webform ← Previous edit		Latest revision as of 12:59, 30 April 2025 edit undo Frap (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 35,598 edits No edit summary
(41 intermediate revisions by 20 users not shown)
Line 1: {{Short description\|Problem in computer science}} In computer science, the '''count-distinct problem'''<ref>{{cite journal \| last1=Ullman \| first1=Jeff \|author1-link=Jeffrey Ullman\| last2 = Rajaraman \| first2 = Anand \| last3=Leskovec \| first3=Jure \|author3-link=Jure Leskovec\| title=Mining data streams \| url=http://infolab.stanford.edu/~ullman/mmds/ch4.pdf}} </ref> Line 5 ⟶ 6: ==Formal definition== : '''Instance''': AConsider a stream of elements <math> x_1, x_2, \ldots,~~x_s~~ x_s</math> with repetitions~~, and an integer <math> m </math>~~. Let <math> n </math> bedenote the number of distinct elements, ~~namely~~in ~~<math>~~the nstream, =with ~~\|\left\{~~the ~~{x_1,x_2,\ldots,x_s}\right\}\|~~set ~~</math>,~~of ~~and~~distinct ~~let~~elements ~~these~~represented ~~elements be~~as <math> \~~left\{~~ {e_1, e_2, \ldots, e_n~~}\right~~\} </math>. : '''Objective''': Find an estimate <math> \widehat{n} </math> of <math> n </math> using only <math> m </math> storage units, where <math> m \ll n </math>. Line 34 ⟶ 36: === Min/max sketches === Min/max sketches <ref name=cosma2011>{{cite journal \| last1=Cosma \| first1=Ioana A. \| last2 = Clifford \| first2 = Peter \| year=2011 \| title=A statistical analysis of probabilistic counting algorithms \| journal=Scandinavian Journal of Statistics\| arxiv=0801.3552 }} </ref><ref>{{cite book \| last1=Giroire \| first1=Frederic \| title=2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO) \| pages=223–231 \| last2 = Fusy \| first2 = Eric \| year=2007 \|citeseerx=10.1.1.214.270 \|doi=10.1137/1.9781611972979.9\| isbn=978-1-61197-297-9 }} </ref> store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. <ref>{{cite journal \| last1=Chassaing \| first1=Philippe \| last2=Gerin \| first2=Lucas \| year=2006\| title=Efficient estimation of the cardinality of large data sets \| journal=Proceedings of the 4th Colloquium on Mathematics and Computer Science\| bibcode=2007math......1347C \| arxiv=math/0701347 }} </ref> presents max sketch which is the [[minimum-variance unbiased estimator]] for the problem. The continuous max sketches estimator <ref name=edithCohen>{{cite journal \| last1=Cohen \| first1=Edith \|authorlink=Edith Cohen\| year=1997\| title=Size-estimation framework with applications to transitive closure and reachability \| journal=J. Comput. Syst. Sci.\| volume=55 \| issue=3 \| pages=441–453 \| doi=10.1006/jcss.1997.1534 \| doi-access=free }} </ref> is the [[maximum likelihood]] estimator. The estimator of choice in practice is the [[HyperLogLog]] algorithm.<ref name=hyperloglog>{{cite journal \| last1=Flajolet \| first1=Philippe\|author1-link=Philippe Flajolet \| last2 = Fusy \| first2 = Eric \| last3=Gandouet \| first3=Olivier \| last4=Meunier \| first4=Frederic \| title=HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm \| journal=Analysis of Algorithms \|year=2007\|url=https://hal.archives-ouvertes.fr/docs/00/40/61/66/PDF/FlFuGaMe07.pdf}} </ref> The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element <math> e_j </math> is associated with a uniform [[Random variable\|RV]], <math> h(e_j) \sim U(0,1) </math>, the expected minimum value of <math> h(e_1),h(e_2), \ldots, h(e_n) </math> is <math> 1/(n+1) </math>. The hash function guarantees that <math> h(e_j) </math> is identical for all the appearances of <math> e_j </math>. Thus, the existence of duplicates does not affect the value of the extreme order statistics. There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation ~~by [[Philippe Flajolet\|Flajolet]] et al.~~ <ref>{{cite journal \| last1=Flajolet \| first1=Philippe\|author1-link=Philippe Flajolet \| last2 = Martin \| first2 = G. Nigel \| year=1985 \| title=Probabilistic counting algorithms for data base applications \| journal=J. Comput. Syst. Sci. \| volume=31\| issue=2\| pages=182–209\| doi=10.1016/0022-0000(85)90041-8\| url=https://hal.inria.fr/inria-00076244/file/RR-0313.pdf}} </ref> describes the [[Flajolet–Martin algorithm]], a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by [[Daniel Kane (mathematician)\|Daniel M. Kane]], [[Jelani Nelson]], and David P. Woodruff.<ref name=optimalf0>{{cite journal \| last1=Kane \| first1=Daniel M. \| last2 = Nelson \| first2 = Jelani \| last3=Woodruff \| first3=David P. \| year=2010 \| authorlink1=Daniel Kane (mathematician) \| authorlink2=Jelani Nelson \| title=An Optimal Algorithm for the Distinct Elements Problem \| journal=Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS)\|url=https://dash.harvard.edu/bitstream/handle/1/13820438/f0.pdf;sequence=1}} </ref> ===Bottom-''m'' sketches=== Line 46 ⟶ 48: are a generalization of min sketches, which maintain the <math> m </math> minimal values, where <math> m \geq 1 </math>. See Cosma et al.<ref name=cosma2011 /> for a theoretical overview of count-distinct estimation algorithms, and Metwally <ref>{{Citation \| last1=Metwally \| first1=Ahmed \| last2=Agrawal \| first2=Divyakant \| last3=Abbadi \| first3= Amr El \| year=2008 \| title=Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic \| series=Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology \| pages=618–629 \| citeseerx=10.1.1.377.4771 }} </ref> for a practical overview with comparative simulation results. === Python implementation of Knuth's CVM algorithm === <syntaxhighlight lang="python3" line="1"> def algorithm_d(stream, s: int): m = len(stream) # We assume that this is given to us in advance. t = -1 # Note that Knuth indexes the stream from 1. p = 1 a = 0 buffer = [] while t < (m - 1): t += 1 a = stream[t] u = uniform(0, 1) buffer = list(filter(lambda x: x[1] != a, buffer)) if u < p: if len(buffer) < s: buffer.append([u, a]) else: buffer = sorted(buffer) p = max(buffer[-1][0], u) buffer.pop() buffer.append([u, a]) return len(buffer) / p </syntaxhighlight> === CVM algorithm === Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm<ref>{{Cite book \|last1=Chakraborty \|first1=Sourav \|last2=Vinodchandran \|first2=N. V. \|last3=Meel \|first3=Kuldeep S. \|date=2022 \|title=Distinct Elements in Streams: An Algorithm for the (Text) Book \|series=Leibniz International Proceedings in Informatics (LIPIcs) \|volume=244 \|pages=6 pages, 727571 bytes \|publisher=Schloss Dagstuhl – Leibniz-Zentrum für Informatik \|doi=10.4230/LIPIcs.ESA.2022.34 \|doi-access=free \|arxiv=2301.10191 \|isbn=978-3-95977-247-1 \|issn=1868-8969}}</ref> (named by [[Donald Knuth]] after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,<ref name=":0" /> in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> {{nowrap\|Initialize <math> p \leftarrow 1 </math>}} Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math> Initialize an empty buffer, {{mvar\|B}} {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|If <math> (a_t, u), \forall u</math> is in {{mvar\|B}} then}} {{nowrap\|Delete <math> (a_t, u) </math> from {{mvar\|B}}}} {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u < p </math> then}} {{nowrap\|If <math> \|B\| < s </math> then}} insert <math> (a_t, u) </math> in {{mvar\|B}} else <math>(a',u')</math> such that <math>u' = \max\{u'':(a'',u'')\in B, \forall a''\}</math> /* <math>(a',u')</math> whose <math>u'</math> is maximum in {{mvar\|B}} */ If <math> u > u' </math> then <math>p\leftarrow u</math> else Replace <math>(a',u')</math> with <math> (a_t, u) </math> <math>p\leftarrow u'</math> {{nowrap\|End For}} {{nowrap\|return <math> \|B\| / p </math>.}} The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. <ref name=":0">{{cite journal \|last1=Knuth \|first1=Donald \|date=May 2023 \|title=The CVM Algorithm for Estimating Distinct Elements in Streams \|url=https://cs.stanford.edu/~knuth/papers/cvm-note.pdf \|journal=}}</ref> {{nowrap\|Initialize <math> p \leftarrow 1 </math>}} Initialize max buffer size <math> s </math>, where <math> s \geq 1 </math> Initialize an empty buffer, {{mvar\|B}} {{nowrap\|For each element <math> a_t </math>}} in data stream <math> A </math> of size <math> n </math> do: {{nowrap\|If <math> a_t </math> is in {{mvar\|B}} then}} {{nowrap\|Delete <math> a_t </math> from {{mvar\|B}}}} {{nowrap\|<math> u \leftarrow </math> random number in <math> [0, 1) </math>}} {{nowrap\|If <math> u \leq p </math> then}} Insert <math> (a_t, u) </math> into {{mvar\|B}} {{nowrap\|While <math> \|B\| = s \wedge u < p </math> then}} Remove every element of <math>(a', u')</math> of {{mvar\|B}} with <math> u' > \frac{p}{2} </math> {{nowrap\|<math> p \leftarrow \frac{p}{2} </math>}} {{nowrap\|End While}} If <math> u < p </math> then Insert <math> (a_t, u) </math> into {{mvar\|B}} {{nowrap\|End For}} {{nowrap\|return <math> \|B\| / p </math>.}} ==Weighted count-distinct problem== Line 66 ⟶ 135: .<ref>{{cite journal \| last1=Cohen \|author1-link=Edith Cohen\| first1=Reuven \| last2 = Katzir \| first2 = Liran \| last3=Yehezkel \| first3=Aviv \| year=2014\| title=A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation \| journal=Information Processing Letters\|doi=10.1016/j.ipl.2014.10.009 \| volume=115 \|issue=2\| pages=336–342}}</ref> For example, the weighted estimator proposed by Cohen et al.<ref name=edithCohen /> can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the [[HyperLogLog]] algorithm <ref name=hyperloglog /> can be extended to solve the weighted problem. The extended [[HyperLogLog]] algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem. ==See also== Line 79 ⟶ 148: [[Category:Statistical algorithms]] [[Category:Articles with example Python (programming language) code]]